Open-Meteo maintains an API for historical weather that allows for non-commercial usage of historical weather data maintained by the website.
This file builds on _v001, _v002, _v003, _v004, and _v005 to run exploratory analysis on some historical weather data.
The exploration process uses tidyverse, ranger, several generic custom functions, and several functions specific to Open Meteo processing. First, tidyverse, ranger, and the generic functions are loaded:
library(tidyverse) # tidyverse functionality is included throughout
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ranger) # predict() does not work on ranger objects unless ranger has been called
source("./Generic_Added_Utility_Functions_202105_v001.R") # Basic functions
Next, specific functions written in _v001, _v002, _v003, _v004, and _v005 are sourced:
source("./SimpleOneVar_Functions_202411_v001.R") # Functions for basic single variable analysis
source("./OpenMeteo_NextBest_202411_v001.R") # Functions for finding 'next best' predictor given existing model
source("./OpenMeteo_Functions_202411_v001.R") # Core functions for loading, processing, analysis of Open Meteo
source("./Generic_Analysis_Functions_202411_v001.R") # Additional functions for random forest and related analysis
Key mapping tables for available metrics are also copied:
hourlyMetrics <- "temperature_2m,relativehumidity_2m,dewpoint_2m,apparent_temperature,pressure_msl,surface_pressure,precipitation,rain,snowfall,cloudcover,cloudcover_low,cloudcover_mid,cloudcover_high,shortwave_radiation,direct_radiation,direct_normal_irradiance,diffuse_radiation,windspeed_10m,windspeed_100m,winddirection_10m,winddirection_100m,windgusts_10m,et0_fao_evapotranspiration,weathercode,vapor_pressure_deficit,soil_temperature_0_to_7cm,soil_temperature_7_to_28cm,soil_temperature_28_to_100cm,soil_temperature_100_to_255cm,soil_moisture_0_to_7cm,soil_moisture_7_to_28cm,soil_moisture_28_to_100cm,soil_moisture_100_to_255cm"
dailyMetrics <- "weathercode,temperature_2m_max,temperature_2m_min,apparent_temperature_max,apparent_temperature_min,precipitation_sum,rain_sum,snowfall_sum,precipitation_hours,sunrise,sunset,windspeed_10m_max,windgusts_10m_max,winddirection_10m_dominant,shortwave_radiation_sum,et0_fao_evapotranspiration"
hourlyDescription <- "Air temperature at 2 meters above ground\nRelative humidity at 2 meters above ground\nDew point temperature at 2 meters above ground\nApparent temperature is the perceived feels-like temperature combining wind chill factor, relative humidity and solar radiation\nAtmospheric air pressure reduced to mean sea level (msl) or pressure at surface. Typically pressure on mean sea level is used in meteorology. Surface pressure gets lower with increasing elevation.\nAtmospheric air pressure reduced to mean sea level (msl) or pressure at surface. Typically pressure on mean sea level is used in meteorology. Surface pressure gets lower with increasing elevation.\nTotal precipitation (rain, showers, snow) sum of the preceding hour. Data is stored with a 0.1 mm precision. If precipitation data is summed up to monthly sums, there might be small inconsistencies with the total precipitation amount.\nOnly liquid precipitation of the preceding hour including local showers and rain from large scale systems.\nSnowfall amount of the preceding hour in centimeters. For the water equivalent in millimeter, divide by 7. E.g. 7 cm snow = 10 mm precipitation water equivalent\nTotal cloud cover as an area fraction\nLow level clouds and fog up to 2 km altitude\nMid level clouds from 2 to 6 km altitude\nHigh level clouds from 6 km altitude\nShortwave solar radiation as average of the preceding hour. This is equal to the total global horizontal irradiation\nDirect solar radiation as average of the preceding hour on the horizontal plane and the normal plane (perpendicular to the sun)\nDirect solar radiation as average of the preceding hour on the horizontal plane and the normal plane (perpendicular to the sun)\nDiffuse solar radiation as average of the preceding hour\nWind speed at 10 or 100 meters above ground. Wind speed on 10 meters is the standard level.\nWind speed at 10 or 100 meters above ground. Wind speed on 10 meters is the standard level.\nWind direction at 10 or 100 meters above ground\nWind direction at 10 or 100 meters above ground\nGusts at 10 meters above ground of the indicated hour. Wind gusts in CERRA are defined as the maximum wind gusts of the preceding hour. Please consult the ECMWF IFS documentation for more information on how wind gusts are parameterized in weather models.\nET0 Reference Evapotranspiration of a well watered grass field. Based on FAO-56 Penman-Monteith equations ET0 is calculated from temperature, wind speed, humidity and solar radiation. Unlimited soil water is assumed. ET0 is commonly used to estimate the required irrigation for plants.\nWeather condition as a numeric code. Follow WMO weather interpretation codes. See table below for details. Weather code is calculated from cloud cover analysis, precipitation and snowfall. As barely no information about atmospheric stability is available, estimation about thunderstorms is not possible.\nVapor Pressure Deificit (VPD) in kilopascal (kPa). For high VPD (>1.6), water transpiration of plants increases. For low VPD (<0.4), transpiration decreases\nAverage temperature of different soil levels below ground.\nAverage temperature of different soil levels below ground.\nAverage temperature of different soil levels below ground.\nAverage temperature of different soil levels below ground.\nAverage soil water content as volumetric mixing ratio at 0-7, 7-28, 28-100 and 100-255 cm depths.\nAverage soil water content as volumetric mixing ratio at 0-7, 7-28, 28-100 and 100-255 cm depths.\nAverage soil water content as volumetric mixing ratio at 0-7, 7-28, 28-100 and 100-255 cm depths.\nAverage soil water content as volumetric mixing ratio at 0-7, 7-28, 28-100 and 100-255 cm depths."
dailyDescription <- "The most severe weather condition on a given day\nMaximum and minimum daily air temperature at 2 meters above ground\nMaximum and minimum daily air temperature at 2 meters above ground\nMaximum and minimum daily apparent temperature\nMaximum and minimum daily apparent temperature\nSum of daily precipitation (including rain, showers and snowfall)\nSum of daily rain\nSum of daily snowfall\nThe number of hours with rain\nSun rise and set times\nSun rise and set times\nMaximum wind speed and gusts on a day\nMaximum wind speed and gusts on a day\nDominant wind direction\nThe sum of solar radiaion on a given day in Megajoules\nDaily sum of ET0 Reference Evapotranspiration of a well watered grass field"
# Create tibble for hourly metrics
tblMetricsHourly <- tibble::tibble(metric=hourlyMetrics %>% str_split_1(","),
description=hourlyDescription %>% str_split_1("\n")
)
tblMetricsHourly %>%
print(n=50)
## # A tibble: 33 × 2
## metric description
## <chr> <chr>
## 1 temperature_2m Air temperature at 2 meters above ground
## 2 relativehumidity_2m Relative humidity at 2 meters above ground
## 3 dewpoint_2m Dew point temperature at 2 meters above ground
## 4 apparent_temperature Apparent temperature is the perceived feels-li…
## 5 pressure_msl Atmospheric air pressure reduced to mean sea l…
## 6 surface_pressure Atmospheric air pressure reduced to mean sea l…
## 7 precipitation Total precipitation (rain, showers, snow) sum …
## 8 rain Only liquid precipitation of the preceding hou…
## 9 snowfall Snowfall amount of the preceding hour in centi…
## 10 cloudcover Total cloud cover as an area fraction
## 11 cloudcover_low Low level clouds and fog up to 2 km altitude
## 12 cloudcover_mid Mid level clouds from 2 to 6 km altitude
## 13 cloudcover_high High level clouds from 6 km altitude
## 14 shortwave_radiation Shortwave solar radiation as average of the pr…
## 15 direct_radiation Direct solar radiation as average of the prece…
## 16 direct_normal_irradiance Direct solar radiation as average of the prece…
## 17 diffuse_radiation Diffuse solar radiation as average of the prec…
## 18 windspeed_10m Wind speed at 10 or 100 meters above ground. W…
## 19 windspeed_100m Wind speed at 10 or 100 meters above ground. W…
## 20 winddirection_10m Wind direction at 10 or 100 meters above ground
## 21 winddirection_100m Wind direction at 10 or 100 meters above ground
## 22 windgusts_10m Gusts at 10 meters above ground of the indicat…
## 23 et0_fao_evapotranspiration ET0 Reference Evapotranspiration of a well wat…
## 24 weathercode Weather condition as a numeric code. Follow WM…
## 25 vapor_pressure_deficit Vapor Pressure Deificit (VPD) in kilopascal (k…
## 26 soil_temperature_0_to_7cm Average temperature of different soil levels b…
## 27 soil_temperature_7_to_28cm Average temperature of different soil levels b…
## 28 soil_temperature_28_to_100cm Average temperature of different soil levels b…
## 29 soil_temperature_100_to_255cm Average temperature of different soil levels b…
## 30 soil_moisture_0_to_7cm Average soil water content as volumetric mixin…
## 31 soil_moisture_7_to_28cm Average soil water content as volumetric mixin…
## 32 soil_moisture_28_to_100cm Average soil water content as volumetric mixin…
## 33 soil_moisture_100_to_255cm Average soil water content as volumetric mixin…
# Create tibble for daily metrics
tblMetricsDaily <- tibble::tibble(metric=dailyMetrics %>% str_split_1(","),
description=dailyDescription %>% str_split_1("\n")
)
tblMetricsDaily
## # A tibble: 16 × 2
## metric description
## <chr> <chr>
## 1 weathercode The most severe weather condition on a given day
## 2 temperature_2m_max Maximum and minimum daily air temperature at 2 me…
## 3 temperature_2m_min Maximum and minimum daily air temperature at 2 me…
## 4 apparent_temperature_max Maximum and minimum daily apparent temperature
## 5 apparent_temperature_min Maximum and minimum daily apparent temperature
## 6 precipitation_sum Sum of daily precipitation (including rain, showe…
## 7 rain_sum Sum of daily rain
## 8 snowfall_sum Sum of daily snowfall
## 9 precipitation_hours The number of hours with rain
## 10 sunrise Sun rise and set times
## 11 sunset Sun rise and set times
## 12 windspeed_10m_max Maximum wind speed and gusts on a day
## 13 windgusts_10m_max Maximum wind speed and gusts on a day
## 14 winddirection_10m_dominant Dominant wind direction
## 15 shortwave_radiation_sum The sum of solar radiaion on a given day in Megaj…
## 16 et0_fao_evapotranspiration Daily sum of ET0 Reference Evapotranspiration of …
A previously existing hourly dataset is loaded, with key analysis variables defined in a vector:
# Load previous data
allCityHourly <- readFromRDS("allCity_20241116")
# Get core training variables
varsTrainHourly <- getVarsTrain(allCityHourly)
varsTrainHourly
## [1] "hour" "temperature_2m"
## [3] "relativehumidity_2m" "dewpoint_2m"
## [5] "apparent_temperature" "pressure_msl"
## [7] "surface_pressure" "precipitation"
## [9] "rain" "snowfall"
## [11] "cloudcover" "cloudcover_low"
## [13] "cloudcover_mid" "cloudcover_high"
## [15] "shortwave_radiation" "direct_radiation"
## [17] "direct_normal_irradiance" "diffuse_radiation"
## [19] "windspeed_10m" "windspeed_100m"
## [21] "winddirection_10m" "winddirection_100m"
## [23] "windgusts_10m" "et0_fao_evapotranspiration"
## [25] "weathercode" "vapor_pressure_deficit"
## [27] "soil_temperature_0_to_7cm" "soil_temperature_7_to_28cm"
## [29] "soil_temperature_28_to_100cm" "soil_temperature_100_to_255cm"
## [31] "soil_moisture_0_to_7cm" "soil_moisture_7_to_28cm"
## [33] "soil_moisture_28_to_100cm" "soil_moisture_100_to_255cm"
## [35] "year" "doy"
# Assign default label
keyLabel <- genericKeyLabelOM()
keyLabel
## [1] "predictions based on pre-2022 training data applied to 2022 holdout dataset"
# Get years of data available
allCityHourly %>%
mutate(year=year(date)) %>%
group_by(src) %>%
summarize(n=n(), minYear=min(year), maxYear=max(year), minDate=min(date), maxDate=max(date))
## # A tibble: 7 × 6
## src n minYear maxYear minDate maxDate
## <chr> <int> <dbl> <dbl> <date> <date>
## 1 Boston 122712 2010 2023 2010-01-01 2023-12-31
## 2 Chicago 122712 2010 2023 2010-01-01 2023-12-31
## 3 Houston 122712 2010 2023 2010-01-01 2023-12-31
## 4 LA 122712 2010 2023 2010-01-01 2023-12-31
## 5 Miami 122712 2010 2023 2010-01-01 2023-12-31
## 6 NYC 117936 2010 2023 2010-01-01 2023-06-15
## 7 Vegas 122712 2010 2023 2010-01-01 2023-12-31
A daily dataset is created from existing data:
omCityMap <- c("bos"="Boston MA",
"ord"="Chicago IL",
"hou"="Houston TX",
"lax"="Los Angeles CA",
"las"="Las Vegas NV",
"mia"="Miami FL",
"nyc"="New York NY"
)
allCityDaily <- map_dfr(.x=names(omCityMap),
.f=function(x) omRunAllSteps(dlData=FALSE, runStats=FALSE, abbCity=x, returnDF=TRUE),
.id="src"
) %>%
mutate(cityName=unname(omCityMap)[as.integer(src)])
##
## Objects in JSON include: latitude, longitude, generationtime_ms, utc_offset_seconds, timezone, timezone_abbreviation, elevation, daily_units, daily
##
## Rows: 5,113
## Columns: 21
## $ date <date> 2010-01-01, 2010-01-02, 2010-01-03, 2010-0…
## $ time <chr> "2010-01-01", "2010-01-02", "2010-01-03", "…
## $ weathercode <fct> 71, 73, 73, 3, 3, 3, 1, 71, 1, 0, 3, 3, 2, …
## $ temperature_2m_max <dbl> 3.4, 0.5, -1.5, 0.0, -0.4, -0.7, 1.6, -2.0,…
## $ temperature_2m_min <dbl> -2.4, -5.6, -8.8, -5.0, -5.9, -5.3, -4.9, -…
## $ apparent_temperature_max <dbl> -0.2, -4.1, -8.5, -5.8, -5.2, -6.7, -3.6, -…
## $ apparent_temperature_min <dbl> -6.6, -13.9, -17.0, -10.9, -11.3, -10.9, -1…
## $ precipitation_sum <dbl> 0.5, 8.9, 6.7, 0.0, 0.0, 0.0, 0.0, 0.2, 0.0…
## $ rain_sum <dbl> 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ snowfall_sum <dbl> 0.49, 5.53, 5.04, 0.00, 0.00, 0.00, 0.00, 0…
## $ precipitation_hours <dbl> 4, 24, 20, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0,…
## $ sunrise <dttm> 2010-01-01 07:13:00, 2010-01-02 07:13:00, …
## $ sunset <dttm> 2010-01-01 16:22:00, 2010-01-02 16:23:00, …
## $ windspeed_10m_max <dbl> 13.1, 34.7, 34.8, 28.1, 17.1, 20.7, 21.1, 2…
## $ windgusts_10m_max <dbl> 23.4, 64.8, 68.0, 50.8, 30.6, 38.9, 38.5, 3…
## $ winddirection_10m_dominant <int> 332, 334, 304, 304, 299, 302, 292, 319, 322…
## $ shortwave_radiation_sum <dbl> 6.58, 3.86, 4.04, 7.66, 7.55, 8.26, 8.82, 6…
## $ et0_fao_evapotranspiration <dbl> 0.66, 0.65, 0.69, 0.87, 0.76, 0.93, 1.01, 0…
## $ sunrise_chr <chr> "2010-01-01T07:13", "2010-01-02T07:13", "20…
## $ sunset_chr <chr> "2010-01-01T16:22", "2010-01-02T16:23", "20…
## $ fct_winddir <fct> 332, 334, 304, 304, 299, 302, 292, 319, 322…
##
## Objects in JSON include: latitude, longitude, generationtime_ms, utc_offset_seconds, timezone, timezone_abbreviation, elevation, daily_units, daily
##
## Rows: 23,376
## Columns: 21
## $ date <date> 1960-01-01, 1960-01-02, 1960-01-03, 1960-0…
## $ time <chr> "1960-01-01", "1960-01-02", "1960-01-03", "…
## $ weathercode <fct> 3, 51, 3, 2, 3, 1, 3, 3, 51, 51, 61, 63, 53…
## $ temperature_2m_max <dbl> 1.6, 4.9, -0.6, -5.9, -6.1, 1.0, 4.9, 1.6, …
## $ temperature_2m_min <dbl> -3.5, -0.9, -7.5, -11.8, -11.2, -10.3, -3.2…
## $ apparent_temperature_max <dbl> -4.3, -0.6, -6.9, -13.1, -13.6, -4.4, -1.3,…
## $ apparent_temperature_min <dbl> -7.9, -7.0, -14.4, -18.4, -17.7, -17.9, -8.…
## $ precipitation_sum <dbl> 0.0, 1.7, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.6…
## $ rain_sum <dbl> 0.0, 1.7, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.6…
## $ snowfall_sum <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
## $ precipitation_hours <dbl> 0, 11, 0, 0, 0, 0, 0, 0, 6, 1, 5, 24, 5, 10…
## $ sunrise <dttm> 1960-01-01 07:18:00, 1960-01-02 07:18:00, …
## $ sunset <dttm> 1960-01-01 16:29:00, 1960-01-02 16:30:00, …
## $ windspeed_10m_max <dbl> 22.7, 27.6, 27.0, 29.1, 29.6, 32.1, 32.7, 3…
## $ windgusts_10m_max <dbl> 40.0, 52.6, 44.6, 50.8, 48.2, 52.6, 56.5, 5…
## $ winddirection_10m_dominant <int> 142, 214, 268, 247, 261, 232, 234, 275, 185…
## $ shortwave_radiation_sum <dbl> 7.45, 2.25, 4.58, 8.66, 9.09, 8.79, 5.86, 8…
## $ et0_fao_evapotranspiration <dbl> 0.95, 0.66, 1.06, 1.06, 1.04, 1.33, 1.23, 1…
## $ sunrise_chr <chr> "1960-01-01T07:18", "1960-01-02T07:18", "19…
## $ sunset_chr <chr> "1960-01-01T16:29", "1960-01-02T16:30", "19…
## $ fct_winddir <fct> 142, 214, 268, 247, 261, 232, 234, 275, 185…
##
## Objects in JSON include: latitude, longitude, generationtime_ms, utc_offset_seconds, timezone, timezone_abbreviation, elevation, daily_units, daily
##
## Rows: 5,113
## Columns: 21
## $ date <date> 2010-01-01, 2010-01-02, 2010-01-03, 2010-0…
## $ time <chr> "2010-01-01", "2010-01-02", "2010-01-03", "…
## $ weathercode <fct> 3, 1, 3, 3, 0, 51, 55, 2, 0, 0, 1, 1, 53, 6…
## $ temperature_2m_max <dbl> 11.8, 12.0, 10.0, 7.6, 8.0, 12.7, 13.4, 0.8…
## $ temperature_2m_min <dbl> 3.9, 0.7, 4.4, 1.8, -1.9, -0.1, -0.2, -3.0,…
## $ apparent_temperature_max <dbl> 8.3, 8.8, 7.2, 4.5, 5.1, 10.9, 12.5, -5.7, …
## $ apparent_temperature_min <dbl> 1.0, -2.6, 1.2, -2.2, -5.5, -3.6, -6.8, -9.…
## $ precipitation_sum <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 1.1, 6.4, 0.0, 0.0…
## $ rain_sum <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 1.1, 6.4, 0.0, 0.0…
## $ snowfall_sum <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ precipitation_hours <dbl> 0, 0, 0, 0, 0, 4, 10, 0, 0, 0, 0, 0, 5, 17,…
## $ sunrise <dttm> 2010-01-01 08:17:00, 2010-01-02 08:17:00, …
## $ sunset <dttm> 2010-01-01 18:33:00, 2010-01-02 18:34:00, …
## $ windspeed_10m_max <dbl> 25.9, 11.2, 14.4, 21.6, 8.8, 17.4, 32.9, 22…
## $ windgusts_10m_max <dbl> 46.8, 23.4, 28.1, 42.1, 20.2, 33.8, 59.0, 4…
## $ winddirection_10m_dominant <int> 350, 89, 65, 344, 15, 126, 345, 345, 353, 7…
## $ shortwave_radiation_sum <dbl> 11.87, 13.86, 4.76, 13.82, 14.46, 4.55, 6.9…
## $ et0_fao_evapotranspiration <dbl> 1.69, 1.88, 1.03, 1.74, 1.55, 0.91, 1.20, 1…
## $ sunrise_chr <chr> "2010-01-01T08:17", "2010-01-02T08:17", "20…
## $ sunset_chr <chr> "2010-01-01T18:33", "2010-01-02T18:34", "20…
## $ fct_winddir <fct> 350, 89, 65, 344, 15, 126, 345, 345, 353, 7…
##
## Objects in JSON include: latitude, longitude, generationtime_ms, utc_offset_seconds, timezone, timezone_abbreviation, elevation, daily_units, daily
##
## Rows: 5,113
## Columns: 21
## $ date <date> 2010-01-01, 2010-01-02, 2010-01-03, 2010-0…
## $ time <chr> "2010-01-01", "2010-01-02", "2010-01-03", "…
## $ weathercode <fct> 2, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 61, 0, …
## $ temperature_2m_max <dbl> 20.1, 23.2, 23.0, 22.1, 22.9, 23.2, 23.3, 2…
## $ temperature_2m_min <dbl> 4.7, 6.7, 6.5, 6.5, 5.0, 7.7, 5.2, 8.4, 7.2…
## $ apparent_temperature_max <dbl> 17.5, 20.6, 19.9, 18.3, 19.8, 19.8, 20.2, 1…
## $ apparent_temperature_min <dbl> 0.9, 2.9, 2.7, 2.3, 0.8, 4.0, 1.6, 5.0, 3.5…
## $ precipitation_sum <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ rain_sum <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ snowfall_sum <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ precipitation_hours <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 0…
## $ sunrise <dttm> 2010-01-01 07:59:00, 2010-01-02 08:00:00, …
## $ sunset <dttm> 2010-01-01 17:55:00, 2010-01-02 17:56:00, …
## $ windspeed_10m_max <dbl> 10.3, 13.0, 12.4, 11.4, 11.5, 9.8, 8.8, 12.…
## $ windgusts_10m_max <dbl> 23.0, 32.0, 31.3, 25.9, 24.8, 19.8, 18.4, 3…
## $ winddirection_10m_dominant <int> 7, 9, 20, 13, 9, 20, 22, 24, 17, 11, 26, 17…
## $ shortwave_radiation_sum <dbl> 8.99, 12.29, 11.80, 10.71, 12.54, 12.24, 12…
## $ et0_fao_evapotranspiration <dbl> 1.97, 2.67, 2.95, 2.74, 2.58, 2.60, 2.31, 2…
## $ sunrise_chr <chr> "2010-01-01T07:59", "2010-01-02T08:00", "20…
## $ sunset_chr <chr> "2010-01-01T17:55", "2010-01-02T17:56", "20…
## $ fct_winddir <fct> 7, 9, 20, 13, 9, 20, 22, 24, 17, 11, 26, 17…
##
## Objects in JSON include: latitude, longitude, generationtime_ms, utc_offset_seconds, timezone, timezone_abbreviation, elevation, daily_units, daily
##
## Rows: 5,113
## Columns: 21
## $ date <date> 2010-01-01, 2010-01-02, 2010-01-03, 2010-0…
## $ time <chr> "2010-01-01", "2010-01-02", "2010-01-03", "…
## $ weathercode <fct> 2, 0, 0, 1, 1, 1, 2, 1, 1, 2, 1, 2, 51, 1, …
## $ temperature_2m_max <dbl> 10.3, 14.2, 14.2, 13.3, 13.6, 15.8, 16.1, 1…
## $ temperature_2m_min <dbl> -1.3, -0.4, 0.7, 2.8, 0.7, 2.5, 6.0, 1.2, 0…
## $ apparent_temperature_max <dbl> 6.3, 10.7, 9.7, 9.1, 9.7, 12.0, 12.0, 6.6, …
## $ apparent_temperature_min <dbl> -4.9, -3.6, -2.8, -0.8, -3.2, -0.9, 3.0, -2…
## $ precipitation_sum <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ rain_sum <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ snowfall_sum <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ precipitation_hours <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0…
## $ sunrise <dttm> 2010-01-01 07:51:00, 2010-01-02 07:52:00, …
## $ sunset <dttm> 2010-01-01 17:36:00, 2010-01-02 17:37:00, …
## $ windspeed_10m_max <dbl> 8.4, 5.4, 9.7, 8.2, 6.4, 6.3, 13.7, 9.7, 5.…
## $ windgusts_10m_max <dbl> 18.0, 16.6, 24.1, 22.0, 18.7, 16.6, 32.0, 2…
## $ winddirection_10m_dominant <int> 327, 332, 341, 338, 317, 311, 3, 3, 317, 34…
## $ shortwave_radiation_sum <dbl> 10.98, 11.85, 11.72, 11.04, 11.88, 11.74, 1…
## $ et0_fao_evapotranspiration <dbl> 1.55, 1.62, 1.91, 1.84, 1.78, 1.80, 2.22, 1…
## $ sunrise_chr <chr> "2010-01-01T07:51", "2010-01-02T07:52", "20…
## $ sunset_chr <chr> "2010-01-01T17:36", "2010-01-02T17:37", "20…
## $ fct_winddir <fct> 327, 332, 341, 338, 317, 311, 3, 3, 317, 34…
##
## Objects in JSON include: latitude, longitude, generationtime_ms, utc_offset_seconds, timezone, timezone_abbreviation, elevation, daily_units, daily
##
## Rows: 5,113
## Columns: 21
## $ date <date> 2010-01-01, 2010-01-02, 2010-01-03, 2010-0…
## $ time <chr> "2010-01-01", "2010-01-02", "2010-01-03", "…
## $ weathercode <fct> 53, 1, 51, 3, 3, 1, 1, 2, 61, 3, 2, 3, 1, 3…
## $ temperature_2m_max <dbl> 26.6, 18.0, 16.7, 15.5, 14.9, 13.8, 16.6, 2…
## $ temperature_2m_min <dbl> 17.5, 11.6, 11.3, 9.0, 9.4, 6.3, 8.6, 11.6,…
## $ apparent_temperature_max <dbl> 25.7, 13.8, 13.6, 10.8, 10.8, 9.2, 14.3, 20…
## $ apparent_temperature_min <dbl> 14.5, 6.9, 6.6, 4.0, 5.0, 0.3, 3.9, 10.2, 1…
## $ precipitation_sum <dbl> 2.3, 0.0, 0.9, 0.0, 0.0, 0.0, 0.0, 0.0, 6.6…
## $ rain_sum <dbl> 2.3, 0.0, 0.9, 0.0, 0.0, 0.0, 0.0, 0.0, 6.6…
## $ snowfall_sum <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ precipitation_hours <dbl> 6, 0, 4, 0, 0, 0, 0, 0, 12, 0, 0, 0, 0, 0, …
## $ sunrise <dttm> 2010-01-01 08:07:00, 2010-01-02 08:07:00, …
## $ sunset <dttm> 2010-01-01 18:41:00, 2010-01-02 18:42:00, …
## $ windspeed_10m_max <dbl> 31.3, 29.2, 22.6, 21.0, 23.7, 27.4, 18.6, 1…
## $ windgusts_10m_max <dbl> 50.4, 46.8, 32.4, 36.7, 41.0, 42.1, 29.5, 2…
## $ winddirection_10m_dominant <int> 263, 350, 343, 324, 321, 329, 345, 247, 334…
## $ shortwave_radiation_sum <dbl> 11.06, 15.60, 12.41, 16.22, 14.96, 16.69, 1…
## $ et0_fao_evapotranspiration <dbl> 2.80, 3.50, 3.54, 3.76, 2.96, 3.13, 3.27, 2…
## $ sunrise_chr <chr> "2010-01-01T08:07", "2010-01-02T08:07", "20…
## $ sunset_chr <chr> "2010-01-01T18:41", "2010-01-02T18:42", "20…
## $ fct_winddir <fct> 263, 350, 343, 324, 321, 329, 345, 247, 334…
##
## Objects in JSON include: latitude, longitude, generationtime_ms, utc_offset_seconds, timezone, timezone_abbreviation, elevation, daily_units, daily
##
## Rows: 4,914
## Columns: 21
## $ date <date> 2010-01-01, 2010-01-02, 2010-01-03, 2010-0…
## $ time <chr> "2010-01-01", "2010-01-02", "2010-01-03", "…
## $ weathercode <fct> 73, 71, 71, 1, 1, 2, 2, 71, 0, 0, 2, 2, 3, …
## $ temperature_2m_max <dbl> 5.0, -0.6, -4.8, -0.8, -0.2, 1.1, 3.6, 1.9,…
## $ temperature_2m_min <dbl> -1.4, -9.2, -10.0, -7.3, -7.3, -5.3, -3.7, …
## $ apparent_temperature_max <dbl> 2.2, -6.0, -12.5, -6.6, -6.0, -5.0, -1.2, -…
## $ apparent_temperature_min <dbl> -5.6, -16.9, -17.6, -13.5, -12.4, -10.8, -9…
## $ precipitation_sum <dbl> 1.8, 0.7, 0.0, 0.0, 0.0, 0.0, 0.0, 0.7, 0.0…
## $ rain_sum <dbl> 0.3, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0…
## $ snowfall_sum <dbl> 1.05, 0.70, 0.28, 0.00, 0.00, 0.00, 0.00, 0…
## $ precipitation_hours <dbl> 5, 4, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0…
## $ sunrise <dttm> 2010-01-01 08:19:00, 2010-01-02 08:19:00, …
## $ sunset <dttm> 2010-01-01 17:39:00, 2010-01-02 17:40:00, …
## $ windspeed_10m_max <dbl> 11.0, 27.2, 35.3, 20.5, 17.8, 20.2, 18.1, 2…
## $ windgusts_10m_max <dbl> 18.4, 53.3, 61.2, 43.6, 34.9, 39.2, 32.8, 3…
## $ winddirection_10m_dominant <int> 297, 307, 297, 301, 296, 301, 295, 292, 320…
## $ shortwave_radiation_sum <dbl> 5.34, 6.81, 5.64, 8.88, 9.33, 8.95, 9.53, 5…
## $ et0_fao_evapotranspiration <dbl> 0.56, 1.06, 1.09, 1.13, 1.02, 1.18, 1.11, 0…
## $ sunrise_chr <chr> "2010-01-01T08:19", "2010-01-02T08:19", "20…
## $ sunset_chr <chr> "2010-01-01T17:39", "2010-01-02T17:40", "20…
## $ fct_winddir <fct> 297, 307, 297, 301, 296, 301, 295, 292, 320…
allCityDaily %>%
mutate(year=year(date)) %>%
group_by(cityName) %>%
summarize(n=n(), minYear=min(year), maxYear=max(year), minDate=min(date), maxDate=max(date))
## # A tibble: 7 × 6
## cityName n minYear maxYear minDate maxDate
## <chr> <int> <dbl> <dbl> <date> <date>
## 1 Boston MA 5113 2010 2023 2010-01-01 2023-12-31
## 2 Chicago IL 23376 1960 2023 1960-01-01 2023-12-31
## 3 Houston TX 5113 2010 2023 2010-01-01 2023-12-31
## 4 Las Vegas NV 5113 2010 2023 2010-01-01 2023-12-31
## 5 Los Angeles CA 5113 2010 2023 2010-01-01 2023-12-31
## 6 Miami FL 5113 2010 2023 2010-01-01 2023-12-31
## 7 New York NY 4914 2010 2023 2010-01-01 2023-06-15
Hourly data is updated with city name to match daily data:
hourlyCityMap <- c("Boston"="Boston MA",
"Chicago"="Chicago IL",
"Houston"="Houston TX",
"LA"="Los Angeles CA",
"Vegas"="Las Vegas NV",
"Miami"="Miami FL",
"NYC"="New York NY"
)
allCityHourly <- allCityHourly %>%
mutate(cityName=hourlyCityMap[src])
allCityHourly %>%
group_by(src, cityName) %>%
summarize(n=n(), minDate=min(date), maxDate=max(date), .groups="drop")
## # A tibble: 7 × 5
## src cityName n minDate maxDate
## <chr> <chr> <int> <date> <date>
## 1 Boston Boston MA 122712 2010-01-01 2023-12-31
## 2 Chicago Chicago IL 122712 2010-01-01 2023-12-31
## 3 Houston Houston TX 122712 2010-01-01 2023-12-31
## 4 LA Los Angeles CA 122712 2010-01-01 2023-12-31
## 5 Miami Miami FL 122712 2010-01-01 2023-12-31
## 6 NYC New York NY 117936 2010-01-01 2023-06-15
## 7 Vegas Las Vegas NV 122712 2010-01-01 2023-12-31
Alignment of maximum temperature is explored:
maxTDelta <- allCityHourly %>%
group_by(cityName, date) %>%
summarize(maxT=max(temperature_2m), .groups="drop") %>%
filter(year(date)<=2022) %>%
left_join(select(allCityDaily, cityName, date, temperature_2m_max), by=c("cityName", "date")) %>%
mutate(delta=maxT-temperature_2m_max)
maxTDelta %>%
summary()
## cityName date maxT temperature_2m_max
## Length:33236 Min. :2010-01-01 Min. :-22.50 Min. :-22.80
## Class :character 1st Qu.:2013-04-01 1st Qu.: 14.70 1st Qu.: 14.70
## Mode :character Median :2016-07-01 Median : 23.60 Median : 23.60
## Mean :2016-07-01 Mean : 21.34 Mean : 21.33
## 3rd Qu.:2019-10-01 3rd Qu.: 28.80 3rd Qu.: 28.80
## Max. :2022-12-31 Max. : 45.90 Max. : 45.90
## delta
## Min. :-2.700000
## 1st Qu.: 0.000000
## Median : 0.000000
## Mean : 0.005277
## 3rd Qu.: 0.000000
## Max. : 5.900000
maxTDelta %>%
group_by(cityName) %>%
summarize(pctDiff=mean(abs(delta)>0.01))
## # A tibble: 7 × 2
## cityName pctDiff
## <chr> <dbl>
## 1 Boston MA 0
## 2 Chicago IL 0.104
## 3 Houston TX 0
## 4 Las Vegas NV 0
## 5 Los Angeles CA 0
## 6 Miami FL 0
## 7 New York NY 0
10% of observations in Chicago differ, potentially due to time zone conversions. All other cities align
The differences in Chicago observations are further explored:
maxTDelta %>%
filter(cityName=="Chicago IL") %>%
mutate(mon=month(date), year=year(date), isDiff=abs(delta)>0.01) %>%
group_by(year, mon) %>%
summarize(pctDiff=mean(isDiff), n=n(), .groups="drop") %>%
ggplot(aes(x=factor(month.abb[mon], levels=month.abb), y=pctDiff, group=1)) +
geom_line() +
geom_hline(data=~summarize(., yint=sum(n*pctDiff)/sum(n)), aes(yintercept=yint), lty=2) +
facet_wrap(~year) +
labs(x=NULL, y="Proportion mismatched", title="Mismatches between daily and hourly maximum temperature")
In general, mismatches seem to be more common in the colder season
A sample of the larger mismatches are explored:
tmpDates <- maxTDelta %>%
filter(cityName=="Chicago IL") %>%
mutate(mon=month(date), year=year(date), isDiff=abs(delta)>2.5) %>%
filter(isDiff) %>%
pull(date)
tmpDates
## [1] "2013-01-30" "2013-04-17" "2013-12-05" "2016-04-04" "2017-12-05"
## [6] "2018-05-20" "2019-01-20" "2020-12-24" "2022-02-17"
tmpDF <- allCityHourly %>%
filter(cityName=="Chicago IL") %>%
select(cityName, time, date, hour, temperature_2m)
map_dfr(.x=tmpDates,
.f=function(x) tmpDF %>% filter(date %in% c(x, x-1, x+1)),
.id="src"
) %>%
mutate(src=tmpDates[as.integer(src)], isSame=(src==date)) %>%
left_join(select(maxTDelta, cityName, src=date, temperature_2m_max), by=c("cityName", "src")) %>%
ggplot(aes(x=time, y=temperature_2m)) +
geom_line() +
geom_line(aes(y=temperature_2m_max), lty=2) +
facet_wrap(~src, scales="free") +
labs(x=NULL, y="Temperature (C)", title="Hourly temperatures near key disconnect dates")
Disconnects appear to arise when the maximum temperature is reached very early and/or very late in the day. This is consistent with time zones potentially driving the differences.
Shading is added for clarity:
map_dfr(.x=tmpDates,
.f=function(x) tmpDF %>% filter(date %in% c(x, x-1, x+1)),
.id="src"
) %>%
mutate(src=tmpDates[as.integer(src)], isSame=(src==date)) %>%
left_join(select(maxTDelta, cityName, src=date, temperature_2m_max), by=c("cityName", "src")) %>%
ggplot(aes(x=time, y=temperature_2m)) +
geom_line() +
geom_line(aes(y=temperature_2m_max), lty=2) +
facet_wrap(~src, scales="free") +
labs(x=NULL,
y="Temperature (C)",
title="Hourly temperatures near key disconnect dates",
subtitle="Line is hourly data, shaded region is the 24 hours in the key date",
caption="Dotted line is maximum temperature reported in daily data"
) +
geom_tile(data=~filter(., isSame), fill="lightblue", aes(x=time, y=0, height=+Inf), alpha=0.5)
Chicago data are shifted by an hour to check whether time zones fully explain the differences:
# Run previously
# tmpDF <- allCityHourly %>%
# filter(cityName=="Chicago IL") %>%
# select(cityName, time, date, hour, temperature_2m)
tmpDF %>%
mutate(newtime=time-hours(1), date=date(newtime)) %>%
filter(!(date %in% c(min(date)))) %>%
group_by(cityName, date) %>%
summarize(maxT=max(temperature_2m), .groups="drop") %>%
filter(year(date)<=2022) %>%
left_join(select(allCityDaily, cityName, date, temperature_2m_max), by=c("cityName", "date")) %>%
mutate(delta=maxT-temperature_2m_max) %>%
summary()
## cityName date maxT temperature_2m_max
## Length:4748 Min. :2010-01-01 Min. :-22.80 Min. :-22.80
## Class :character 1st Qu.:2013-04-01 1st Qu.: 4.90 1st Qu.: 4.90
## Mode :character Median :2016-07-01 Median : 14.90 Median : 14.90
## Mean :2016-07-01 Mean : 14.28 Mean : 14.28
## 3rd Qu.:2019-10-01 3rd Qu.: 24.20 3rd Qu.: 24.20
## Max. :2022-12-31 Max. : 37.40 Max. : 37.40
## delta
## Min. :0
## 1st Qu.:0
## Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
If date is recalculated based on time minus one hour, the Chicago hourly data aligns with the Chicago daily data for maximum temperature
Alignment of precipitation is explored, first with time zones in hourly data “as is”:
sumPDelta <- allCityHourly %>%
group_by(cityName, date) %>%
summarize(sumP=sum(precipitation), .groups="drop") %>%
filter(year(date)<=2022) %>%
left_join(select(allCityDaily, cityName, date, precipitation_sum), by=c("cityName", "date")) %>%
mutate(delta=sumP-precipitation_sum)
sumPDelta %>%
summary()
## cityName date sumP precipitation_sum
## Length:33236 Min. :2010-01-01 Min. : 0.000 Min. : 0.000
## Class :character 1st Qu.:2013-04-01 1st Qu.: 0.000 1st Qu.: 0.000
## Mode :character Median :2016-07-01 Median : 0.000 Median : 0.000
## Mean :2016-07-01 Mean : 2.575 Mean : 2.575
## 3rd Qu.:2019-10-01 3rd Qu.: 1.500 3rd Qu.: 1.600
## Max. :2022-12-31 Max. :143.900 Max. :143.900
## delta
## Min. :-14.7
## 1st Qu.: 0.0
## Median : 0.0
## Mean : 0.0
## 3rd Qu.: 0.0
## Max. : 14.7
sumPDelta %>%
group_by(cityName) %>%
summarize(pctDiff=mean(abs(delta)>0.005))
## # A tibble: 7 × 2
## cityName pctDiff
## <chr> <dbl>
## 1 Boston MA 0
## 2 Chicago IL 0.259
## 3 Houston TX 0
## 4 Las Vegas NV 0
## 5 Los Angeles CA 0
## 6 Miami FL 0
## 7 New York NY 0
25% of observations in Chicago differ, likely due to time zone conversions. All other cities align
Chicago data is updated to be one hour earlier:
sumPDelta <- allCityHourly %>%
mutate(date=as.Date(ifelse(cityName %in% c("Chicago IL"), time-hours(1), date))) %>%
group_by(cityName, date) %>%
summarize(sumP=sum(precipitation), .groups="drop") %>%
filter(year(date)<=2022) %>%
left_join(select(allCityDaily, cityName, date, precipitation_sum), by=c("cityName", "date")) %>%
mutate(delta=sumP-precipitation_sum)
sumPDelta %>%
summary()
## cityName date sumP precipitation_sum
## Length:28488 Min. :2010-01-01 Min. : 0.000 Min. : 0.000
## Class :character 1st Qu.:2013-04-01 1st Qu.: 0.000 1st Qu.: 0.000
## Mode :character Median :2016-07-01 Median : 0.000 Median : 0.000
## Mean :2016-07-01 Mean : 2.499 Mean : 2.499
## 3rd Qu.:2019-10-01 3rd Qu.: 1.400 3rd Qu.: 1.400
## Max. :2022-12-31 Max. :143.900 Max. :143.900
## delta
## Min. :-1.421e-14
## 1st Qu.: 0.000e+00
## Median : 0.000e+00
## Mean : 9.273e-18
## 3rd Qu.: 0.000e+00
## Max. : 7.105e-15
sumPDelta %>%
group_by(cityName) %>%
summarize(pctDiff=mean(abs(delta)>0.005))
## # A tibble: 6 × 2
## cityName pctDiff
## <chr> <dbl>
## 1 Boston MA 0
## 2 Houston TX 0
## 3 Las Vegas NV 0
## 4 Los Angeles CA 0
## 5 Miami FL 0
## 6 New York NY 0
After adjusting for time zone in Chicago, all observations align
The process is generalized in a function that adjusts only Chicago and with other significant hard-coding:
tmpDailyVsHourly <- function(df1=allCityHourly,
df2=allCityDaily,
chgCities=c("Chicago IL"),
chgHours=-1,
varDF1="precipitation",
varDF2="precipitation_sum",
fn=sum,
eqTol=0.005
) {
df <- df1 %>%
mutate(date=as.Date(ifelse(cityName %in% chgCities, time+hours(chgHours), date))) %>%
group_by(cityName, date) %>%
summarize(across(.cols=all_of(varDF1), .fns=fn, .names="agg"), .groups="drop") %>%
filter(year(date)<=2022) %>%
left_join(select(df2, all_of(c("cityName", "date", varDF2))),
by=c("cityName", "date")
) %>%
mutate(delta=agg-get(varDF2))
df %>%
summary() %>%
print()
df %>%
group_by(cityName) %>%
summarize(pctDiff=mean(abs(delta)>eqTol)) %>%
print()
}
tmpDailyVsHourly(varDF1="precipitation", varDF2="precipitation_sum", fn=sum, eqTol=0.005)
## cityName date agg precipitation_sum
## Length:28488 Min. :2010-01-01 Min. : 0.000 Min. : 0.000
## Class :character 1st Qu.:2013-04-01 1st Qu.: 0.000 1st Qu.: 0.000
## Mode :character Median :2016-07-01 Median : 0.000 Median : 0.000
## Mean :2016-07-01 Mean : 2.499 Mean : 2.499
## 3rd Qu.:2019-10-01 3rd Qu.: 1.400 3rd Qu.: 1.400
## Max. :2022-12-31 Max. :143.900 Max. :143.900
## delta
## Min. :-1.421e-14
## 1st Qu.: 0.000e+00
## Median : 0.000e+00
## Mean : 9.273e-18
## 3rd Qu.: 0.000e+00
## Max. : 7.105e-15
## # A tibble: 6 × 2
## cityName pctDiff
## <chr> <dbl>
## 1 Boston MA 0
## 2 Houston TX 0
## 3 Las Vegas NV 0
## 4 Los Angeles CA 0
## 5 Miami FL 0
## 6 New York NY 0
The function is tested for maximum temperature:
# No change in time
tmpDailyVsHourly(varDF1="temperature_2m", varDF2="temperature_2m_max", fn=max, eqTol=0.005, chgCities=c())
## cityName date agg temperature_2m_max
## Length:33236 Min. :2010-01-01 Min. :-22.50 Min. :-22.80
## Class :character 1st Qu.:2013-04-01 1st Qu.: 14.70 1st Qu.: 14.70
## Mode :character Median :2016-07-01 Median : 23.60 Median : 23.60
## Mean :2016-07-01 Mean : 21.34 Mean : 21.33
## 3rd Qu.:2019-10-01 3rd Qu.: 28.80 3rd Qu.: 28.80
## Max. :2022-12-31 Max. : 45.90 Max. : 45.90
## delta
## Min. :-2.700000
## 1st Qu.: 0.000000
## Median : 0.000000
## Mean : 0.005277
## 3rd Qu.: 0.000000
## Max. : 5.900000
## # A tibble: 7 × 2
## cityName pctDiff
## <chr> <dbl>
## 1 Boston MA 0
## 2 Chicago IL 0.104
## 3 Houston TX 0
## 4 Las Vegas NV 0
## 5 Los Angeles CA 0
## 6 Miami FL 0
## 7 New York NY 0
# Chicago shifted by -1 hours (default)
tmpDailyVsHourly(varDF1="temperature_2m", varDF2="temperature_2m_max", fn=max, eqTol=0.005)
## cityName date agg temperature_2m_max
## Length:28488 Min. :2010-01-01 Min. :-15.40 Min. :-15.40
## Class :character 1st Qu.:2013-04-01 1st Qu.: 16.50 1st Qu.: 16.50
## Mode :character Median :2016-07-01 Median : 24.60 Median : 24.60
## Mean :2016-07-01 Mean : 22.51 Mean : 22.51
## 3rd Qu.:2019-10-01 3rd Qu.: 29.30 3rd Qu.: 29.30
## Max. :2022-12-31 Max. : 45.90 Max. : 45.90
## delta
## Min. :0
## 1st Qu.:0
## Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
## # A tibble: 6 × 2
## cityName pctDiff
## <chr> <dbl>
## 1 Boston MA 0
## 2 Houston TX 0
## 3 Las Vegas NV 0
## 4 Los Angeles CA 0
## 5 Miami FL 0
## 6 New York NY 0
The function is tested for maximum and minimum apparent temperature:
# Maximum apparent temperature, with Chicago time shift
tmpDailyVsHourly(varDF1="apparent_temperature", varDF2="apparent_temperature_max", fn=max, eqTol=0.005)
## cityName date agg
## Length:28488 Min. :2010-01-01 Min. :-23.60
## Class :character 1st Qu.:2013-04-01 1st Qu.: 13.90
## Mode :character Median :2016-07-01 Median : 24.30
## Mean :2016-07-01 Mean : 22.04
## 3rd Qu.:2019-10-01 3rd Qu.: 31.40
## Max. :2022-12-31 Max. : 45.60
## apparent_temperature_max delta
## Min. :-23.60 Min. :0
## 1st Qu.: 13.90 1st Qu.:0
## Median : 24.30 Median :0
## Mean : 22.04 Mean :0
## 3rd Qu.: 31.40 3rd Qu.:0
## Max. : 45.60 Max. :0
## # A tibble: 6 × 2
## cityName pctDiff
## <chr> <dbl>
## 1 Boston MA 0
## 2 Houston TX 0
## 3 Las Vegas NV 0
## 4 Los Angeles CA 0
## 5 Miami FL 0
## 6 New York NY 0
# Minimum apparent temperature, with Chicago time shift
tmpDailyVsHourly(varDF1="apparent_temperature", varDF2="apparent_temperature_min", fn=min, eqTol=0.005)
## cityName date agg
## Length:28488 Min. :2010-01-01 Min. :-33.10
## Class :character 1st Qu.:2013-04-01 1st Qu.: 3.30
## Mode :character Median :2016-07-01 Median : 12.90
## Mean :2016-07-01 Mean : 12.12
## 3rd Qu.:2019-10-01 3rd Qu.: 22.00
## Max. :2022-12-31 Max. : 33.20
## apparent_temperature_min delta
## Min. :-33.10 Min. :0
## 1st Qu.: 3.30 1st Qu.:0
## Median : 12.90 Median :0
## Mean : 12.12 Mean :0
## 3rd Qu.: 22.00 3rd Qu.:0
## Max. : 33.20 Max. :0
## # A tibble: 6 × 2
## cityName pctDiff
## <chr> <dbl>
## 1 Boston MA 0
## 2 Houston TX 0
## 3 Las Vegas NV 0
## 4 Los Angeles CA 0
## 5 Miami FL 0
## 6 New York NY 0
The function is tested for sum of precipitation and snowfall:
# Precipitation, with Chicago time shift
tmpDailyVsHourly(varDF1="precipitation", varDF2="precipitation_sum", fn=sum, eqTol=0.005)
## cityName date agg precipitation_sum
## Length:28488 Min. :2010-01-01 Min. : 0.000 Min. : 0.000
## Class :character 1st Qu.:2013-04-01 1st Qu.: 0.000 1st Qu.: 0.000
## Mode :character Median :2016-07-01 Median : 0.000 Median : 0.000
## Mean :2016-07-01 Mean : 2.499 Mean : 2.499
## 3rd Qu.:2019-10-01 3rd Qu.: 1.400 3rd Qu.: 1.400
## Max. :2022-12-31 Max. :143.900 Max. :143.900
## delta
## Min. :-1.421e-14
## 1st Qu.: 0.000e+00
## Median : 0.000e+00
## Mean : 9.273e-18
## 3rd Qu.: 0.000e+00
## Max. : 7.105e-15
## # A tibble: 6 × 2
## cityName pctDiff
## <chr> <dbl>
## 1 Boston MA 0
## 2 Houston TX 0
## 3 Las Vegas NV 0
## 4 Los Angeles CA 0
## 5 Miami FL 0
## 6 New York NY 0
# Snowfall, with Chicago time shift
tmpDailyVsHourly(varDF1="snowfall", varDF2="snowfall_sum", fn=sum, eqTol=0.005)
## cityName date agg snowfall_sum
## Length:28488 Min. :2010-01-01 Min. : 0.00000 Min. : 0.00000
## Class :character 1st Qu.:2013-04-01 1st Qu.: 0.00000 1st Qu.: 0.00000
## Mode :character Median :2016-07-01 Median : 0.00000 Median : 0.00000
## Mean :2016-07-01 Mean : 0.08226 Mean : 0.08226
## 3rd Qu.:2019-10-01 3rd Qu.: 0.00000 3rd Qu.: 0.00000
## Max. :2022-12-31 Max. :32.83000 Max. :32.83000
## delta
## Min. :-3.553e-15
## 1st Qu.: 0.000e+00
## Median : 0.000e+00
## Mean : 1.270e-18
## 3rd Qu.: 0.000e+00
## Max. : 3.553e-15
## # A tibble: 6 × 2
## cityName pctDiff
## <chr> <dbl>
## 1 Boston MA 0
## 2 Houston TX 0
## 3 Las Vegas NV 0
## 4 Los Angeles CA 0
## 5 Miami FL 0
## 6 New York NY 0
The function is tested for hours of precipitation:
# Precipitation hours, with Chicago time shift
tmpDailyVsHourly(varDF1="precipitation", varDF2="precipitation_hours", fn=function(x) sum(x>0), eqTol=0.005)
## cityName date agg precipitation_hours
## Length:28488 Min. :2010-01-01 Min. : 0.000 Min. : 0.000
## Class :character 1st Qu.:2013-04-01 1st Qu.: 0.000 1st Qu.: 0.000
## Mode :character Median :2016-07-01 Median : 0.000 Median : 0.000
## Mean :2016-07-01 Mean : 2.992 Mean : 2.992
## 3rd Qu.:2019-10-01 3rd Qu.: 4.000 3rd Qu.: 4.000
## Max. :2022-12-31 Max. :24.000 Max. :24.000
## delta
## Min. :0
## 1st Qu.:0
## Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
## # A tibble: 6 × 2
## cityName pctDiff
## <chr> <dbl>
## 1 Boston MA 0
## 2 Houston TX 0
## 3 Las Vegas NV 0
## 4 Los Angeles CA 0
## 5 Miami FL 0
## 6 New York NY 0
A random forest model is created to estimate dewpoint based on temperature and apparent temperature given city and month:
# Use only 10% as training data
set.seed(25091014)
idxTrain_v1 <- sample(1:nrow(allCityHourly), round(0.1*nrow(allCityHourly)), replace=FALSE)
rfTestDewP <- runFullRF(dfTrain=allCityHourly[idxTrain_v1, ] %>%
mutate(fct_city=factor(cityName),
fct_mon=factor(month.abb[month(date)], levels=month.abb)
) %>%
select(dewpoint_2m, temperature_2m, apparent_temperature, fct_city, fct_mon),
dfTest=allCityHourly[-idxTrain_v1, ] %>%
mutate(fct_city=factor(cityName),
fct_mon=factor(month.abb[month(date)], levels=month.abb)
),
yVar="dewpoint_2m",
xVars=c("temperature_2m", "apparent_temperature", "fct_city", "fct_mon"),
isContVar=TRUE,
rndTo=-1L,
returnData=TRUE
)
##
## R-squared of test data is: 93.31% (RMSE 2.84 vs. 10.98 null)
## `geom_smooth()` using formula = 'y ~ x'
rfTestDewP$tstPred %>%
group_by(fct_city, fct_mon) %>%
summarize(n=n(),
tssActual=sum((dewpoint_2m-mean(dewpoint_2m))**2),
rssPred=sum((dewpoint_2m-pred)**2),
r2=1-rssPred/tssActual,
rmse=sqrt(rssPred/n),
.groups="drop"
) %>%
ggplot(aes(x=fct_city, y=fct_mon)) +
geom_tile(aes(fill=r2)) +
geom_text(aes(label=round(r2, 3)), size=3) +
scale_fill_continuous("R2", low="white", high="lightgreen", limits=c(0, 1)) +
labs(x=NULL, y=NULL, title="R2 for dewpoint predictions")
Actual daily dewpoints are calculated using hourly data:
dewPDaily <- allCityHourly %>%
mutate(fct_city=factor(cityName),
fct_mon=factor(month.abb[month(date)], levels=month.abb)
) %>%
group_by(fct_city, fct_mon, date) %>%
summarize(across(dewpoint_2m, .fns=list(mean=mean, min=min, max=max)), .groups="drop")
dewPDaily
## # A tibble: 35,592 × 6
## fct_city fct_mon date dewpoint_2m_mean dewpoint_2m_min dewpoint_2m_max
## <fct> <fct> <date> <dbl> <dbl> <dbl>
## 1 Boston MA Jan 2010-01-01 -3.48 -5.1 -1.5
## 2 Boston MA Jan 2010-01-02 -4.96 -10 -0.6
## 3 Boston MA Jan 2010-01-03 -9.23 -12.8 -4.7
## 4 Boston MA Jan 2010-01-04 -6.86 -8.6 -4.7
## 5 Boston MA Jan 2010-01-05 -8.16 -9.4 -6.5
## 6 Boston MA Jan 2010-01-06 -8.53 -9.3 -7.4
## 7 Boston MA Jan 2010-01-07 -7.86 -9.2 -5.5
## 8 Boston MA Jan 2010-01-08 -8.65 -10.9 -7.5
## 9 Boston MA Jan 2010-01-09 -13.5 -17 -11.1
## 10 Boston MA Jan 2010-01-10 -16.4 -18 -14.4
## # ℹ 35,582 more rows
Annual average dewpoints by city are explored:
dewPDaily %>%
mutate(year=year(date)) %>%
filter(year>=2010, year<2023) %>%
group_by(year, fct_city) %>%
summarize(mudp=mean(dewpoint_2m_mean), .groups="drop") %>%
ggplot(aes(x=factor(year), y=mudp)) +
geom_line(aes(group=fct_city, color=fct_city), lwd=1) +
labs(title="Average dewpoint by city and year", y="Average dewpoint (C)", x=NULL) +
scale_color_discrete(NULL)
Monthly average dewpoints by city are explored:
dewPDaily %>%
mutate(year=year(date)) %>%
filter(year>=2010, year<2023) %>%
group_by(fct_city, fct_mon) %>%
summarize(mudp=mean(dewpoint_2m_mean),
maxdp=max(dewpoint_2m_mean),
mindp=min(dewpoint_2m_mean),
.groups="drop"
) %>%
ggplot(aes(x=fct_mon)) +
geom_line(aes(y=mudp, group=fct_city, color=fct_city), lwd=1) +
labs(title="Average dewpoint by city and month", y="Average dewpoint (C)", x=NULL) +
scale_color_discrete(NULL)
Rolling 21-day average dewpoints by city are explored:
dewPDaily %>%
mutate(year=year(date), doy=pmin(365, yday(date))) %>%
arrange(fct_city, date) %>%
group_by(fct_city) %>%
helperRollingAgg(origVar="dewpoint_2m_mean", newName="dewpoint_r21", k=21) %>%
ungroup() %>%
filter(year>=2010, year<2023, !is.na(dewpoint_r21)) %>%
group_by(fct_city, doy) %>%
summarize(mudp=mean(dewpoint_r21),
maxdp=max(dewpoint_r21),
mindp=min(dewpoint_r21),
.groups="drop"
) %>%
ggplot(aes(x=doy)) +
geom_line(aes(y=mudp, group=fct_city, color=fct_city), lwd=1) +
labs(title="Average dewpoint by city and day of year",
subtitle="Rolling 21-day",
y="Average rolling 21-day dewpoint (C)",
x="Day of year"
) +
scale_color_discrete(NULL)
Dew points are added to the daily data:
allCityDaily_v2 <-
allCityDaily %>%
left_join(dewPDaily %>%
mutate(cityName=as.character(fct_city)) %>%
select(cityName, date, starts_with("dewp")),
by=c("cityName", "date")
)
allCityDaily_v2 %>%
group_by(cityName) %>%
summarize(n=n(), hasDew=sum(!is.na(dewpoint_2m_mean)))
## # A tibble: 7 × 3
## cityName n hasDew
## <chr> <int> <int>
## 1 Boston MA 5113 5113
## 2 Chicago IL 23376 5113
## 3 Houston TX 5113 5113
## 4 Las Vegas NV 5113 5113
## 5 Los Angeles CA 5113 5113
## 6 Miami FL 5113 5113
## 7 New York NY 4914 4914
Relationships among temperature, dewpoint, and apparent temperature are explored:
allCityDaily_v2 %>%
mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2,
muApp=(apparent_temperature_max + apparent_temperature_min)/2,
muDew=dewpoint_2m_mean
) %>%
filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
select(cityName, muTemp, muDew, muApp) %>%
mutate(across(where(is.numeric), .fns=round)) %>%
count(cityName, muTemp, muApp) %>%
ggplot(aes(x=muTemp, y=muApp)) +
geom_point(aes(size=n), alpha=0.5)+
facet_wrap(~cityName) +
labs(title="July apparent temperature vs. temperature",
x="Average daily temperature (C)",
y="Average daily\napparent temperature (C)"
) +
scale_size_continuous(NULL)
allCityDaily_v2 %>%
mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2,
muApp=(apparent_temperature_max + apparent_temperature_min)/2,
muDew=dewpoint_2m_mean
) %>%
filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
select(cityName, muTemp, muDew, muApp) %>%
mutate(across(where(is.numeric), .fns=round)) %>%
count(cityName, muDew, muApp) %>%
ggplot(aes(x=muDew, y=muApp)) +
geom_point(aes(size=n), alpha=0.5)+
facet_wrap(~cityName) +
labs(title="July apparent temperature vs. dewpoint",
x="Average daily dewpoint (C)",
y="Average daily\napparent temperature (C)"
) +
scale_size_continuous(NULL)
allCityDaily_v2 %>%
mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2,
muApp=(apparent_temperature_max + apparent_temperature_min)/2,
muDew=dewpoint_2m_mean
) %>%
filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
select(cityName, muTemp, muDew, muApp) %>%
mutate(across(where(is.numeric), .fns=round)) %>%
count(cityName, muDew, muTemp) %>%
ggplot(aes(x=muDew, y=muTemp)) +
geom_point(aes(size=n), alpha=0.5)+
facet_wrap(~cityName) +
labs(title="July temperature vs. dewpoint",
x="Average daily dewpoint (C)",
y="Average daily\ntemperature (C)"
) +
scale_size_continuous(NULL)
Relationships among temperature, dewpoint, and apparent temperature are further explored:
allCityDaily_v2 %>%
mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2,
muApp=(apparent_temperature_max + apparent_temperature_min)/2,
muDew=dewpoint_2m_mean
) %>%
filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
select(cityName, muTemp, muDew, muApp) %>%
mutate(across(where(is.numeric), .fns=round),
appr5=autoRound(muApp, rndTo=5L)
) %>%
count(cityName, muDew, muTemp, appr5) %>%
ggplot(aes(x=muDew, y=muTemp)) +
geom_point(aes(size=n, color=factor(appr5)), alpha=0.5)+
facet_wrap(~cityName) +
labs(title="July temperature vs. dewpoint",
x="Average daily dewpoint (C)",
y="Average daily\ntemperature (C)"
) +
scale_size_continuous(NULL) +
scale_color_discrete("App Temp\n(nearest 5 C)")
A simple linear regression is explored:
allCityDaily_v2 %>%
mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2,
muApp=(apparent_temperature_max + apparent_temperature_min)/2,
muDew=dewpoint_2m_mean
) %>%
filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
select(cityName, muTemp, muDew, muApp) %>%
lm(muApp ~ muTemp, data=.) %>%
summary()
##
## Call:
## lm(formula = muApp ~ muTemp, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1693 -1.3825 -0.1217 1.9902 4.6627
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.40918 0.29867 4.718 2.5e-06 ***
## muTemp 1.02750 0.01116 92.101 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.362 on 2819 degrees of freedom
## Multiple R-squared: 0.7506, Adjusted R-squared: 0.7505
## F-statistic: 8483 on 1 and 2819 DF, p-value: < 2.2e-16
allCityDaily_v2 %>%
mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2,
muApp=(apparent_temperature_max + apparent_temperature_min)/2,
muDew=dewpoint_2m_mean
) %>%
filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
select(cityName, muTemp, muDew, muApp) %>%
lm(muApp ~ muDew, data=.) %>%
summary()
##
## Call:
## lm(formula = muApp ~ muDew, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.865 -3.368 0.385 3.460 12.594
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.10439 0.23536 102.42 <2e-16 ***
## muDew 0.26039 0.01273 20.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.413 on 2819 degrees of freedom
## Multiple R-squared: 0.1293, Adjusted R-squared: 0.129
## F-statistic: 418.7 on 1 and 2819 DF, p-value: < 2.2e-16
On average, temperature is better than dewpoint as a predictor of apparent temperature. Interaction is also explored:
allCityDaily_v2 %>%
mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2,
muApp=(apparent_temperature_max + apparent_temperature_min)/2,
muDew=dewpoint_2m_mean
) %>%
filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
select(cityName, muTemp, muDew, muApp) %>%
lm(muApp ~ muTemp + muDew, data=.) %>%
summary()
##
## Call:
## lm(formula = muApp ~ muTemp + muDew, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5790 -0.5599 0.0916 0.6567 3.2703
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.110891 0.126564 -48.28 <2e-16 ***
## muTemp 1.091163 0.004236 257.58 <2e-16 ***
## muDew 0.337196 0.002586 130.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.891 on 2818 degrees of freedom
## Multiple R-squared: 0.9645, Adjusted R-squared: 0.9645
## F-statistic: 3.831e+04 on 2 and 2818 DF, p-value: < 2.2e-16
allCityDaily_v2 %>%
mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2,
muApp=(apparent_temperature_max + apparent_temperature_min)/2,
muDew=dewpoint_2m_mean
) %>%
filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
select(cityName, muTemp, muDew, muApp) %>%
lm(muApp ~ muTemp + muDew + muTemp:muDew, data=.) %>%
summary()
##
## Call:
## lm(formula = muApp ~ muTemp + muDew + muTemp:muDew, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5912 -0.5465 0.0883 0.6354 3.2415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.0232152 0.3884503 -10.357 < 2e-16 ***
## muTemp 1.0234331 0.0126449 80.936 < 2e-16 ***
## muDew 0.1842179 0.0270512 6.810 1.19e-11 ***
## muTemp:muDew 0.0051692 0.0009099 5.681 1.48e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8861 on 2817 degrees of freedom
## Multiple R-squared: 0.9649, Adjusted R-squared: 0.9649
## F-statistic: 2.583e+04 on 3 and 2817 DF, p-value: < 2.2e-16
The inclusion of both temperature and dewpoint meaningfully decreases residual standard error. The interaction term between temperature and dewpoint, while statistically significant, has little practical impact. City is added as an additional explanatory variable:
allCityDaily_v2 %>%
mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2,
muApp=(apparent_temperature_max + apparent_temperature_min)/2,
muDew=dewpoint_2m_mean
) %>%
filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
select(cityName, muTemp, muDew, muApp) %>%
lm(muApp ~ muTemp + muDew + cityName + 0, data=.) %>%
summary()
##
## Call:
## lm(formula = muApp ~ muTemp + muDew + cityName + 0, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.90625 -0.42940 0.06619 0.46037 2.76072
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## muTemp 1.035693 0.005850 177.03 <2e-16 ***
## muDew 0.301535 0.004165 72.39 <2e-16 ***
## cityNameBoston MA -4.705775 0.145150 -32.42 <2e-16 ***
## cityNameChicago IL -4.732561 0.146445 -32.32 <2e-16 ***
## cityNameHouston TX -3.211754 0.177352 -18.11 <2e-16 ***
## cityNameLas Vegas NV -4.206984 0.188979 -22.26 <2e-16 ***
## cityNameLos Angeles CA -3.749403 0.140676 -26.65 <2e-16 ***
## cityNameMiami FL -3.267359 0.174773 -18.70 <2e-16 ***
## cityNameNew York NY -4.303946 0.154345 -27.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7383 on 2812 degrees of freedom
## Multiple R-squared: 0.9994, Adjusted R-squared: 0.9994
## F-statistic: 4.832e+05 on 9 and 2812 DF, p-value: < 2.2e-16
allCityDaily_v2 %>%
mutate(muTemp=(temperature_2m_max + temperature_2m_min)/2,
muApp=(apparent_temperature_max + apparent_temperature_min)/2,
muDew=dewpoint_2m_mean
) %>%
filter(month(date)==7, year(date)>=2010, year(date)<2023) %>%
select(cityName, muTemp, muDew, muApp) %>%
lm(muApp ~ muTemp:cityName + muDew:cityName + cityName + 0, data=.) %>%
summary()
##
## Call:
## lm(formula = muApp ~ muTemp:cityName + muDew:cityName + cityName +
## 0, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.08814 -0.39255 0.06184 0.42885 1.90562
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## cityNameBoston MA -4.607592 0.285746 -16.125 < 2e-16 ***
## cityNameChicago IL -6.294447 0.250438 -25.134 < 2e-16 ***
## cityNameHouston TX -6.352102 0.886444 -7.166 9.84e-13 ***
## cityNameLas Vegas NV -0.360121 0.447277 -0.805 0.421
## cityNameLos Angeles CA -2.990764 0.372639 -8.026 1.47e-15 ***
## cityNameMiami FL -16.050511 1.563984 -10.263 < 2e-16 ***
## cityNameNew York NY -6.136167 0.349298 -17.567 < 2e-16 ***
## muTemp:cityNameBoston MA 0.984007 0.013408 73.392 < 2e-16 ***
## muTemp:cityNameChicago IL 1.063621 0.016338 65.100 < 2e-16 ***
## muTemp:cityNameHouston TX 1.008152 0.022276 45.258 < 2e-16 ***
## muTemp:cityNameLas Vegas NV 0.929141 0.013618 68.228 < 2e-16 ***
## muTemp:cityNameLos Angeles CA 0.993172 0.012892 77.041 < 2e-16 ***
## muTemp:cityNameMiami FL 1.119919 0.036174 30.960 < 2e-16 ***
## muTemp:cityNameNew York NY 1.043039 0.015807 65.984 < 2e-16 ***
## cityNameBoston MA:muDew 0.365493 0.012838 28.470 < 2e-16 ***
## cityNameChicago IL:muDew 0.350936 0.016710 21.001 < 2e-16 ***
## cityNameHouston TX:muDew 0.469955 0.023034 20.403 < 2e-16 ***
## cityNameLas Vegas NV:muDew 0.239456 0.005067 47.254 < 2e-16 ***
## cityNameLos Angeles CA:muDew 0.319501 0.014467 22.085 < 2e-16 ***
## cityNameMiami FL:muDew 0.741580 0.050220 14.767 < 2e-16 ***
## cityNameNew York NY:muDew 0.389827 0.012843 30.352 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6826 on 2800 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994
## F-statistic: 2.423e+05 on 21 and 2800 DF, p-value: < 2.2e-16
Including city reduces residual error slightly
Dewpoints by month are further explored:
dewPDaily %>%
mutate(gte68=(dewpoint_2m_mean>=20),
lte32=(dewpoint_2m_mean<=0)
) %>%
group_by(fct_city, fct_mon) %>%
summarize(gte68=mean(gte68), lte32=mean(lte32), .groups="drop") %>%
pivot_longer(cols=-c(fct_city, fct_mon)) %>%
ggplot(aes(x=fct_mon, y=value)) +
geom_line(aes(group=name, color=c("gte68"=">=68F", "lte32"="<=32F")[name])) +
facet_wrap(~fct_city) +
labs(x=NULL, y="Proportion of days", title="Proportion of days by month by dewpoint range") +
scale_color_discrete("Dewpoint")
Dewpoints at or above 20C (68F) by year are explored:
dewPDaily %>%
mutate(gte68=(dewpoint_2m_mean>=20),
lte32=(dewpoint_2m_mean<=0),
year=year(date),
doy=yday(date)
) %>%
filter(year < 2023) %>%
arrange(fct_city, date) %>%
group_by(fct_city, year) %>%
mutate(gte68=cumsum(gte68), lte32=cumsum(lte32)) %>%
ungroup() %>%
ggplot(aes(x=doy, y=gte68)) +
geom_line(aes(group=factor(year), color=factor(year))) +
facet_wrap(~fct_city, scales="free_y") +
labs(x=NULL,
y="Days with dewpoint >= 20C (68F)",
title="Cumulative days by year with dewpoint >= 20C (68F)"
) +
scale_color_discrete(NULL)
Dewpoints at or below 0C (32F) by year are explored:
dewPDaily %>%
mutate(gte68=(dewpoint_2m_mean>=20),
lte32=(dewpoint_2m_mean<=0),
year=year(date),
doy=yday(date)
) %>%
filter(year < 2023) %>%
arrange(fct_city, date) %>%
group_by(fct_city, year) %>%
mutate(gte68=cumsum(gte68), lte32=cumsum(lte32)) %>%
ungroup() %>%
ggplot(aes(x=doy, y=lte32)) +
geom_line(aes(group=factor(year), color=factor(year))) +
facet_wrap(~fct_city, scales="free_y") +
labs(x=NULL,
y="Days with dewpoint <= 0C (32F)",
title="Cumulative days by year with dewpoint <= 0C (32F)"
) +
scale_color_discrete(NULL)
Dewpoints at or above 20C (68F) by month are explored:
dewPDaily %>%
mutate(gte68=(dewpoint_2m_mean>=20),
lte32=(dewpoint_2m_mean<=0),
year=year(date),
doy=yday(date)
) %>%
filter(year < 2023) %>%
group_by(fct_city, fct_mon) %>%
summarize(gte68=sum(gte68), lte32=sum(lte32), .groups="drop") %>%
ggplot(aes(x=fct_mon, y=gte68)) +
geom_col(fill="lightblue") +
geom_text(aes(label=gte68), size=2.5, vjust=0) +
facet_wrap(~fct_city) +
labs(x=NULL,
y="# Days (2010-2022)",
title="Days with dewpoint >= 20C (68F)\n2010-2022"
) +
scale_color_discrete(NULL)
Dewpoints at or below 0C (32F) by month are explored:
dewPDaily %>%
mutate(gte68=(dewpoint_2m_mean>=20),
lte32=(dewpoint_2m_mean<=0),
year=year(date),
doy=yday(date)
) %>%
filter(year < 2023) %>%
group_by(fct_city, fct_mon) %>%
summarize(gte68=sum(gte68), lte32=sum(lte32), .groups="drop") %>%
ggplot(aes(x=fct_mon, y=lte32)) +
geom_col(fill="lightblue") +
geom_text(aes(label=lte32), size=2.5, vjust=0) +
facet_wrap(~fct_city) +
labs(x=NULL,
y="# Days (2010-2022)",
title="Days with dewpoint <= 0C (32F)\n2010-2022"
) +
scale_color_discrete(NULL)
Proportion of days with dewpoints at or above 20C (68F) by month are explored:
dewPDaily %>%
mutate(gte68=(dewpoint_2m_mean>=20),
lte32=(dewpoint_2m_mean<=0),
year=year(date),
doy=yday(date)
) %>%
filter(year < 2023) %>%
group_by(fct_city, fct_mon) %>%
summarize(gte68=mean(gte68), lte32=mean(lte32), .groups="drop") %>%
ggplot(aes(x=fct_mon, y=gte68)) +
geom_col(fill="lightblue") +
geom_text(aes(label=round(gte68,2)), size=2.5, vjust=0) +
facet_wrap(~fct_city) +
labs(x=NULL,
y="Proportion of Days (2010-2022)",
title="Days with dewpoint >= 20C (68F)\n2010-2022"
) +
scale_color_discrete(NULL)
Proportion of days with dewpoints at or below 0C (32F) by month are explored:
dewPDaily %>%
mutate(gte68=(dewpoint_2m_mean>=20),
lte32=(dewpoint_2m_mean<=0),
year=year(date),
doy=yday(date)
) %>%
filter(year < 2023) %>%
group_by(fct_city, fct_mon) %>%
summarize(gte68=mean(gte68), lte32=mean(lte32), .groups="drop") %>%
ggplot(aes(x=fct_mon, y=lte32)) +
geom_col(fill="lightblue") +
geom_text(aes(label=round(lte32,2)), size=2.5, vjust=0) +
facet_wrap(~fct_city) +
labs(x=NULL,
y="Proportion of Days (2010-2022)",
title="Days with dewpoint <= 0C (32F)\n2010-2022"
) +
scale_color_discrete(NULL)
ACF is calculated for dew point for a single city:
dewPDaily %>%
filter(fct_city=="Chicago IL") %>%
arrange(date) %>%
select(dewpoint_2m_mean) %>%
acf(lag.max=1000, main="ACF for daily dewpoint - Chicago IL (2010-2022)")
ACF is calculated for dew point for each city:
dewCities <- dewPDaily %>%
pull(fct_city) %>%
as.vector() %>%
unique()
dewACF <- lapply(dewCities,
FUN=function(x) {
dewPDaily %>%
filter(fct_city==x) %>%
arrange(date) %>%
select(dewpoint_2m_mean) %>%
acf(lag.max=2000, plot=FALSE)
}
)
names(dewACF) <- dewCities
ACF are combined into a single tibble:
dewACFAll <- map_dfr(.x=dewCities,
.f=function(x) tibble::tibble(lag=as.vector(dewACF[[x]][["lag"]]),
acf=as.vector(dewACF[[x]][["acf"]])
),
.id="src") %>%
mutate(cityName=dewCities[as.integer(src)])
dewACFAll
## # A tibble: 14,007 × 4
## src lag acf cityName
## <chr> <dbl> <dbl> <chr>
## 1 1 0 1 Boston MA
## 2 1 1 0.882 Boston MA
## 3 1 2 0.778 Boston MA
## 4 1 3 0.754 Boston MA
## 5 1 4 0.750 Boston MA
## 6 1 5 0.749 Boston MA
## 7 1 6 0.750 Boston MA
## 8 1 7 0.749 Boston MA
## 9 1 8 0.741 Boston MA
## 10 1 9 0.731 Boston MA
## # ℹ 13,997 more rows
dewACFAll %>%
filter(lag > 0) %>%
group_by(cityName) %>%
summarize(mu_abs_acf=mean(abs(acf))) %>%
arrange(desc(mu_abs_acf))
## # A tibble: 7 × 2
## cityName mu_abs_acf
## <chr> <dbl>
## 1 Chicago IL 0.388
## 2 Boston MA 0.377
## 3 New York NY 0.372
## 4 Houston TX 0.280
## 5 Miami FL 0.250
## 6 Los Angeles CA 0.222
## 7 Las Vegas NV 0.138
ACF by city are plotted:
dewACFAll %>%
filter(lag>0) %>%
ggplot(aes(x=lag, y=acf)) +
geom_line(aes(color=cityName, group=cityName), lwd=1) +
labs(x="Lag (zero excluded)", y="ACF", title="ACF of dewpoint by city") +
scale_color_discrete(NULL) +
lims(y=c(-1, 1))
PACF is calculated for dew point for a single city:
dewPDaily %>%
filter(fct_city=="Chicago IL") %>%
arrange(date) %>%
select(dewpoint_2m_mean) %>%
pacf(lag.max=1000, main="PACF for daily dewpoint - Chicago IL (2010-2022)")
dewPDaily %>%
filter(fct_city=="Chicago IL") %>%
arrange(date) %>%
select(dewpoint_2m_mean) %>%
pacf(lag.max=20, main="PACF for daily dewpoint - Chicago IL (2010-2022)")
PACF is calculated for dew point for each city:
dewPACF <- lapply(dewCities,
FUN=function(x) {
dewPDaily %>%
filter(fct_city==x) %>%
arrange(date) %>%
select(dewpoint_2m_mean) %>%
pacf(lag.max=2000, plot=FALSE)
}
)
names(dewPACF) <- dewCities
PACF are combined into a single tibble:
dewPACFAll <- map_dfr(.x=dewCities,
.f=function(x) tibble::tibble(lag=as.vector(dewPACF[[x]][["lag"]]),
pacf=as.vector(dewPACF[[x]][["acf"]])
),
.id="src") %>%
mutate(cityName=dewCities[as.integer(src)])
dewPACFAll
## # A tibble: 14,000 × 4
## src lag pacf cityName
## <chr> <dbl> <dbl> <chr>
## 1 1 1 0.882 Boston MA
## 2 1 2 -0.00177 Boston MA
## 3 1 3 0.306 Boston MA
## 4 1 4 0.127 Boston MA
## 5 1 5 0.158 Boston MA
## 6 1 6 0.126 Boston MA
## 7 1 7 0.103 Boston MA
## 8 1 8 0.0677 Boston MA
## 9 1 9 0.0515 Boston MA
## 10 1 10 0.0707 Boston MA
## # ℹ 13,990 more rows
dewPACFAll %>%
filter(lag > 0) %>%
group_by(cityName) %>%
summarize(mu_abs_pacf=mean(abs(pacf))) %>%
arrange(desc(mu_abs_pacf))
## # A tibble: 7 × 2
## cityName mu_abs_pacf
## <chr> <dbl>
## 1 Boston MA 0.0114
## 2 New York NY 0.0113
## 3 Chicago IL 0.0112
## 4 Houston TX 0.0110
## 5 Los Angeles CA 0.0109
## 6 Miami FL 0.0108
## 7 Las Vegas NV 0.0104
PACF by city are plotted:
dewPACFAll %>%
filter(lag>0) %>%
ggplot(aes(x=lag, y=pacf)) +
geom_line(aes(color=cityName, group=cityName), lwd=1) +
labs(x="Lag (zero excluded)", y="PACF", title="PACF of dewpoint by city") +
scale_color_discrete(NULL) +
lims(y=c(-1, 1)) +
scale_x_log10()
A baseline ARIMA model is explored, starting with c(0, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily %>%
filter(fct_city==x) %>%
pull(dewpoint_2m_mean) %>%
arima(order=c(0, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
intercept=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef),
se=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$var.coef %>% sqrt),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName intercept se sigma2 sigma
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 19.8 0.0629 20.3 4.50
## 2 Los Angeles CA 6.85 0.0953 46.4 6.81
## 3 Las Vegas NV -1.59 0.102 53.0 7.28
## 4 Houston TX 15.8 0.109 61.1 7.82
## 5 New York NY 6.56 0.144 102. 10.1
## 6 Boston MA 5.47 0.143 104. 10.2
## 7 Chicago IL 5.43 0.150 115. 10.7
The ARIMA model is explored at c(1, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(dewpoint_2m_mean) %>%
arima(order=c(1, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName sigma2 sigma ar1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 4.85 2.20 0.872 19.8
## 2 Los Angeles CA 10.9 3.30 0.875 6.85
## 3 Las Vegas NV 12.8 3.58 0.871 -1.59
## 4 Chicago IL 16.8 4.10 0.924 5.42
## 5 Houston TX 17.7 4.20 0.843 15.8
## 6 New York NY 21.1 4.60 0.891 6.56
## 7 Boston MA 23.0 4.80 0.882 5.47
The ARIMA model is explored at c(0, 0, 1):
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(dewpoint_2m_mean) %>%
arima(order=c(0, 0, 1))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName sigma2 sigma ma1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 8.38 2.90 0.796 19.8
## 2 Los Angeles CA 19.2 4.38 0.794 6.85
## 3 Las Vegas NV 22.5 4.75 0.776 -1.59
## 4 Houston TX 26.1 5.11 0.786 15.8
## 5 New York NY 42.9 6.55 0.782 6.56
## 6 Chicago IL 43.6 6.60 0.821 5.42
## 7 Boston MA 44.4 6.66 0.776 5.47
The ARIMA model is explored at c(1, 0, 1):
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(dewpoint_2m_mean) %>%
arima(order=c(1, 0, 1))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 6
## cityName sigma2 sigma ar1 ma1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 4.58 2.14 0.799 0.315 19.8
## 2 Los Angeles CA 10.3 3.21 0.804 0.307 6.85
## 3 Las Vegas NV 12.2 3.50 0.807 0.269 -1.59
## 4 Houston TX 16.3 4.04 0.733 0.394 15.8
## 5 Chicago IL 16.6 4.08 0.902 0.151 5.43
## 6 New York NY 21.1 4.60 0.878 0.0594 6.56
## 7 Boston MA 23.0 4.80 0.881 0.00734 5.47
The ARIMA model is explored at c(2, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(dewpoint_2m_mean) %>%
arima(order=c(2, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 6
## cityName sigma2 sigma ar1 ar2 intercept
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 4.68 2.16 1.04 -0.187 19.8
## 2 Los Angeles CA 10.5 3.24 1.03 -0.181 6.85
## 3 Las Vegas NV 12.4 3.52 1.03 -0.183 -1.59
## 4 Chicago IL 16.7 4.09 0.989 -0.0703 5.43
## 5 Houston TX 17.0 4.12 1.00 -0.191 15.8
## 6 New York NY 21.1 4.60 0.908 -0.0194 6.56
## 7 Boston MA 23.0 4.80 0.884 -0.00206 5.47
The auto.arima() process is run to estimate optimal parameters for a single city:
dewPDaily %>%
filter(fct_city=="Chicago IL") %>%
arrange(date) %>%
pull(dewpoint_2m_mean) %>%
forecast::auto.arima()
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Series: .
## ARIMA(5,0,0) with non-zero mean
##
## Coefficients:
## ar1 ar2 ar3 ar4 ar5 mean
## 0.9676 -0.3158 0.1975 -0.0276 0.1317 5.4251
## s.e. 0.0139 0.0194 0.0197 0.0194 0.0139 1.1627
##
## sigma^2 = 15.17: log likelihood = -14205.69
## AIC=28425.37 AICc=28425.39 BIC=28471.15
The auto.arima() process is run to estimate optimal parameters for all cities:
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(dewpoint_2m_mean) %>%
forecast::auto.arima()
)
names(tstARIMA) <- dewCities
The optimal model by city is pulled:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t()
## ar ma d sig2
## Boston MA 5 0 0 20.01
## Chicago IL 5 0 0 15.17
## Houston TX 5 0 0 15.04
## Las Vegas NV 2 3 0 11.83
## Los Angeles CA 2 2 0 9.71
## Miami FL 2 4 1 4.26
## New York NY 5 0 0 18.56
Sigma-squared is also explored:
cat("\n\nOptimal Parameters (ARIMA for daily mean dewpoint by city):\n")
##
##
## Optimal Parameters (ARIMA for daily mean dewpoint by city):
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
full_join(dewPDaily %>% group_by(fct_city) %>% summarize(varOrig=var(dewpoint_2m_mean)),
by=c("cityName"="fct_city")
) %>%
mutate(pct=1-sig2/varOrig)
## cityName ar ma d sig2 varOrig pct
## 1 Boston MA 5 0 0 20.01 103.86622 0.8073483
## 2 Chicago IL 5 0 0 15.17 115.11075 0.8682139
## 3 Houston TX 5 0 0 15.04 61.13661 0.7539935
## 4 Las Vegas NV 2 3 0 11.83 53.01145 0.7768406
## 5 Los Angeles CA 2 2 0 9.71 46.41365 0.7907943
## 6 Miami FL 2 4 1 4.26 20.26281 0.7897626
## 7 New York NY 5 0 0 18.56 102.44193 0.8188242
Rolling 21-day dewpoints by city are plotted:
dewPDaily %>%
mutate(doy=yday(date)) %>%
arrange(date) %>%
group_by(fct_city) %>%
helperRollingAgg(origVar="dewpoint_2m_mean", newName="dewpoint_r21", k=21) %>%
group_by(fct_city, doy) %>%
summarize(dewpoint_r21=mean(dewpoint_r21, na.rm=TRUE), .groups="drop") %>%
ggplot(aes(x=doy)) +
geom_line(aes(y=dewpoint_r21), lwd=2) +
geom_point(data=dewPDaily %>% mutate(doy=yday(date), year=year(date)) %>% arrange(date),
aes(y=dewpoint_2m_mean, group=year),
alpha=0.25,
size=0.1
) +
facet_wrap(~fct_city) +
labs(x="Day of year",
y="Dewpoint (C)",
title="Dewpoint by day of year\n(actual and rolling 21-day average)"
)
Daily dewpoint data are converted to deviation from long-term rolling 21-day mean by day of year:
dewPDaily_vs_r21 <- dewPDaily %>%
mutate(doy=pmin(yday(date), 365), year=year(date)) %>%
left_join(dewPDaily %>%
mutate(doy=pmin(yday(date),365)) %>%
arrange(date) %>%
group_by(fct_city) %>%
helperRollingAgg(origVar="dewpoint_2m_mean", newName="dewpoint_r21", k=21) %>%
group_by(fct_city, doy) %>%
summarize(dewpoint_r21=mean(dewpoint_r21, na.rm=TRUE), .groups="drop"),
by=c("fct_city", "doy")
) %>%
mutate(delta_r21=dewpoint_2m_mean-dewpoint_r21)
dewPDaily_vs_r21
## # A tibble: 35,592 × 10
## fct_city fct_mon date dewpoint_2m_mean dewpoint_2m_min dewpoint_2m_max
## <fct> <fct> <date> <dbl> <dbl> <dbl>
## 1 Boston MA Jan 2010-01-01 -3.48 -5.1 -1.5
## 2 Boston MA Jan 2010-01-02 -4.96 -10 -0.6
## 3 Boston MA Jan 2010-01-03 -9.23 -12.8 -4.7
## 4 Boston MA Jan 2010-01-04 -6.86 -8.6 -4.7
## 5 Boston MA Jan 2010-01-05 -8.16 -9.4 -6.5
## 6 Boston MA Jan 2010-01-06 -8.53 -9.3 -7.4
## 7 Boston MA Jan 2010-01-07 -7.86 -9.2 -5.5
## 8 Boston MA Jan 2010-01-08 -8.65 -10.9 -7.5
## 9 Boston MA Jan 2010-01-09 -13.5 -17 -11.1
## 10 Boston MA Jan 2010-01-10 -16.4 -18 -14.4
## # ℹ 35,582 more rows
## # ℹ 4 more variables: doy <dbl>, year <dbl>, dewpoint_r21 <dbl>,
## # delta_r21 <dbl>
summary(dewPDaily_vs_r21)
## fct_city fct_mon date dewpoint_2m_mean
## Boston MA :5113 Jan : 3038 Min. :2010-01-01 Min. :-35.76667
## Chicago IL :5113 Mar : 3038 1st Qu.:2013-06-25 1st Qu.: -0.01771
## Houston TX :5113 May : 3038 Median :2016-12-17 Median : 9.47500
## Las Vegas NV :5113 Jul : 3007 Mean :2016-12-17 Mean : 8.34512
## Los Angeles CA:5113 Aug : 3007 3rd Qu.:2020-06-10 3rd Qu.: 17.53333
## Miami FL :5113 Oct : 3007 Max. :2023-12-31 Max. : 25.94167
## New York NY :4914 (Other):17457
## dewpoint_2m_min dewpoint_2m_max doy year
## Min. :-38.600 Min. :-32.00 Min. : 1.0 Min. :2010
## 1st Qu.: -3.500 1st Qu.: 3.80 1st Qu.: 91.0 1st Qu.:2013
## Median : 6.300 Median : 12.50 Median :182.0 Median :2016
## Mean : 5.499 Mean : 11.28 Mean :182.6 Mean :2016
## 3rd Qu.: 15.200 3rd Qu.: 19.90 3rd Qu.:274.0 3rd Qu.:2020
## Max. : 25.200 Max. : 27.60 Max. :365.0 Max. :2023
##
## dewpoint_r21 delta_r21
## Min. :-8.98471 Min. :-28.33807
## 1st Qu.:-0.04931 1st Qu.: -2.77771
## Median : 8.78258 Median : 0.33513
## Mean : 8.34958 Mean : -0.00446
## 3rd Qu.:16.58411 3rd Qu.: 3.06098
## Max. :23.96929 Max. : 19.74022
##
A baseline ARIMA model is explored, starting with c(0, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(0, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
intercept=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef),
se=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$var.coef %>% sqrt),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName intercept se sigma2 sigma
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL -0.0220 0.0441 9.95 3.15
## 2 Los Angeles CA 0.0118 0.0701 25.1 5.01
## 3 Houston TX -0.0141 0.0729 27.2 5.21
## 4 Boston MA 0.00479 0.0733 27.5 5.24
## 5 New York NY -0.0177 0.0750 27.6 5.26
## 6 Chicago IL 0.00493 0.0740 28.0 5.29
## 7 Las Vegas NV 0.000614 0.0816 34.1 5.84
The ARIMA model is explored at c(1, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(1, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName sigma2 sigma ar1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 4.52 2.13 0.739 -0.0221
## 2 Los Angeles CA 10.3 3.21 0.768 0.0118
## 3 Las Vegas NV 12.3 3.51 0.799 0.000632
## 4 Chicago IL 14.7 3.84 0.688 0.00495
## 5 Houston TX 15.8 3.97 0.648 -0.0141
## 6 New York NY 17.8 4.22 0.596 -0.0177
## 7 Boston MA 19.0 4.36 0.556 0.00478
The ARIMA model is explored at c(0, 0, 1):
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(0, 0, 1))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName sigma2 sigma ma1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 5.08 2.25 0.724 -0.0219
## 2 Los Angeles CA 12.3 3.51 0.734 0.0117
## 3 Houston TX 15.1 3.88 0.706 -0.0141
## 4 Chicago IL 15.8 3.98 0.678 0.00488
## 5 Las Vegas NV 16.4 4.05 0.731 0.000566
## 6 New York NY 17.7 4.21 0.622 -0.0176
## 7 Boston MA 18.4 4.29 0.606 0.00477
The ARIMA model is explored at c(1, 0, 1):
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(1, 0, 1))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 6
## cityName sigma2 sigma ar1 ma1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 4.11 2.03 0.572 0.406 -0.0222
## 2 Los Angeles CA 9.45 3.07 0.622 0.385 0.0118
## 3 Las Vegas NV 11.6 3.40 0.691 0.318 0.000754
## 4 Houston TX 13.7 3.70 0.397 0.512 -0.0141
## 5 Chicago IL 13.8 3.72 0.503 0.371 0.00503
## 6 New York NY 16.8 4.10 0.353 0.397 -0.0178
## 7 Boston MA 17.9 4.23 0.276 0.429 0.00474
The ARIMA model is explored at c(2, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(2, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 6
## cityName sigma2 sigma ar1 ar2 intercept
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 4.17 2.04 0.945 -0.279 -0.0219
## 2 Los Angeles CA 9.62 3.10 0.963 -0.253 0.0118
## 3 Las Vegas NV 11.7 3.41 0.982 -0.229 0.000716
## 4 Chicago IL 14.0 3.75 0.839 -0.219 0.00523
## 5 Houston TX 14.0 3.75 0.863 -0.332 -0.0139
## 6 New York NY 17.1 4.13 0.719 -0.207 -0.0176
## 7 Boston MA 18.1 4.26 0.673 -0.211 0.00475
Sigma-squared for a single city is compared across models:
pdqList <- list("c(0, 0, 0)"=c(0, 0, 0),
"c(1, 0, 0)"=c(1, 0, 0),
"c(0, 0, 1)"=c(0, 0, 1),
"c(1, 0, 1)"=c(1, 0, 1),
"c(2, 0, 0)"=c(2, 0, 0)
)
sapply(pdqList, FUN=function(pdq) lapply("Boston MA", FUN=function(x) { tmp<- dewPDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(x=., order=unname(pdq)); tmp$sigma2}
)
) %>%
as_tibble() %>%
t()
## [,1]
## c(0, 0, 0) 27.48617
## c(1, 0, 0) 18.98769
## c(0, 0, 1) 18.41728
## c(1, 0, 1) 17.87572
## c(2, 0, 0) 18.14050
Sigma-squared for all cities is compiled across models:
pdqList <- list("c(0, 0, 0)"=c(0, 0, 0),
"c(1, 0, 0)"=c(1, 0, 0),
"c(0, 0, 1)"=c(0, 0, 1),
"c(1, 0, 1)"=c(1, 0, 1),
"c(2, 0, 0)"=c(2, 0, 0)
)
pdqAll <- sapply(pdqList, FUN=function(pdq) sapply(dewCities,
FUN=function(x) {
tmp<- dewPDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(x=., order=unname(pdq))
tmp$sigma2
}
)
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("ARIMA") %>%
as_tibble()
pdqAll
## # A tibble: 5 × 8
## ARIMA `Boston MA` `Chicago IL` `Houston TX` `Las Vegas NV` `Los Angeles CA`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 c(0, 0,… 27.5 28.0 27.2 34.0 25.1
## 2 c(1, 0,… 19.0 14.7 15.8 12.3 10.3
## 3 c(0, 0,… 18.4 15.8 15.1 16.4 12.3
## 4 c(1, 0,… 17.9 13.8 13.7 11.6 9.45
## 5 c(2, 0,… 18.1 14.0 14.0 11.7 9.62
## # ℹ 2 more variables: `Miami FL` <dbl>, `New York NY` <dbl>
Sigma-squared for all cities is plotted:
pdqAll %>%
pivot_longer(cols=-ARIMA, names_to="city") %>%
group_by(city) %>%
mutate(pct=value/max(value)) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) +
facet_wrap(~city) +
geom_col() +
theme(legend.position = "none") +
geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) +
labs(x="ARIMA(p, d, q)",
y="Sigma-squared\nrelative to ARIMA(0, 0, 0)",
title="Relative sigma-squared by model",
subtitle="(daily dewpoint vs. long-term 21-day mean for day of year)"
)
The auto.arima() process is run to estimate optimal parameters for all cities:
tstARIMA <- lapply(dewCities,
FUN=function(x) dewPDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
forecast::auto.arima(d=0)
)
names(tstARIMA) <- dewCities
The auto.arima() parameters are pulled for all cities:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t()
## ar ma d sig2
## Boston MA 1 3 0 17.82
## Chicago IL 3 2 0 13.79
## Houston TX 1 3 0 13.65
## Las Vegas NV 1 1 0 11.55
## Los Angeles CA 2 2 0 9.42
## Miami FL 5 5 0 4.10
## New York NY 2 3 0 16.72
Sigma-squared for all cities is updated:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("name") %>%
select(name, value=sig2) %>%
mutate(ARIMA="auto") %>%
pivot_wider(id_cols="ARIMA") %>%
bind_rows(pdqAll) %>%
pivot_longer(cols=-ARIMA, names_to="city") %>%
group_by(city) %>%
mutate(pct=value/max(value)) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) +
facet_wrap(~city) +
geom_col() +
theme(legend.position = "none") +
geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) +
labs(x="ARIMA(p, d, q)",
y="Sigma-squared\nrelative to ARIMA(0, 0, 0)",
title="Relative sigma-squared by model",
subtitle="(daily dewpoint vs. long-term 21-day mean for day of year)"
)
Sigma-squared for all cities is updated, incuding the baseline:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("name") %>%
select(name, value=sig2) %>%
mutate(ARIMA="auto") %>%
pivot_wider(id_cols="ARIMA") %>%
bind_rows(pdqAll) %>%
bind_rows(dewPDaily %>%
group_by(fct_city) %>%
summarize(value=var(dewpoint_2m_mean)) %>%
mutate(ARIMA="base") %>%
pivot_wider(id_cols="ARIMA", names_from="fct_city")
) %>%
pivot_longer(cols=-ARIMA, names_to="city") %>%
group_by(city) %>%
mutate(pct=value/max(value)) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) +
facet_wrap(~city) +
geom_col() +
theme(legend.position = "none") +
geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) +
labs(x="ARIMA(p, d, q)",
y="Sigma-squared\nrelative to ARIMA(0, 0, 0)",
title="Relative sigma-squared by model",
subtitle="(daily dewpoint vs. long-term 21-day mean for day of year)",
caption="base=daily dewpoint (no adjustmens)\nc(p, d, q) and auto are all vs. long-term 21-day mean"
)
Lag/lead for dewpoint vs. 21-day mean is plotted:
dewPDaily_vs_r21 %>%
arrange(fct_city, date) %>%
group_by(fct_city) %>%
mutate(dewNext=lead(delta_r21)) %>%
ungroup() %>%
filter(complete.cases(.)) %>%
count(fct_city, delta_r21=round(delta_r21), dewNext=round(dewNext)) %>%
ggplot(aes(x=delta_r21, y=dewNext)) +
geom_point(aes(size=n), alpha=0.1) +
facet_wrap(~fct_city) +
geom_smooth(aes(weight=n), method="lm") +
annotate("point", x=0, y=0, color="red") +
geom_abline(intercept=0, slope=1, color="red", lty=2) +
labs(x="Dewpoint vs. long-term 21-day mean",
y="Next day's dewpoint vs. long-term 21-day mean",
title="Dewpoint vs. long-term 21-day mean by day of year and city"
) +
scale_size_continuous("# Obs")
## `geom_smooth()` using formula = 'y ~ x'
Daily mean temperature data are converted to deviation from long-term rolling 21-day mean by day of year:
tempDaily_vs_r21 <- allCityDaily %>%
mutate(doy=pmin(yday(date), 365),
year=year(date),
meanTemp=0.5*(temperature_2m_max+temperature_2m_min),
fct_city=factor(cityName)
) %>%
select(fct_city, date, doy, year, meanTemp) %>%
left_join(allCityDaily %>%
mutate(doy=pmin(yday(date),365),
meanTemp=0.5*(temperature_2m_max+temperature_2m_min),
fct_city=factor(cityName)
) %>%
arrange(date) %>%
group_by(fct_city) %>%
helperRollingAgg(origVar="meanTemp", newName="temp_r21", k=21) %>%
group_by(fct_city, doy) %>%
summarize(temp_r21=mean(temp_r21, na.rm=TRUE), .groups="drop"),
by=c("fct_city", "doy")
) %>%
mutate(delta_r21=meanTemp-temp_r21)
tempDaily_vs_r21
## # A tibble: 53,855 × 7
## fct_city date doy year meanTemp temp_r21 delta_r21
## <fct> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Boston MA 2010-01-01 1 2010 0.5 0.112 0.388
## 2 Boston MA 2010-01-02 2 2010 -2.55 0.0786 -2.63
## 3 Boston MA 2010-01-03 3 2010 -5.15 0.00934 -5.16
## 4 Boston MA 2010-01-04 4 2010 -2.5 -0.152 -2.35
## 5 Boston MA 2010-01-05 5 2010 -3.15 -0.338 -2.81
## 6 Boston MA 2010-01-06 6 2010 -3 -0.396 -2.60
## 7 Boston MA 2010-01-07 7 2010 -1.65 -0.486 -1.16
## 8 Boston MA 2010-01-08 8 2010 -4.6 -0.552 -4.05
## 9 Boston MA 2010-01-09 9 2010 -8.5 -0.567 -7.93
## 10 Boston MA 2010-01-10 10 2010 -9.05 -0.608 -8.44
## # ℹ 53,845 more rows
summary(tempDaily_vs_r21)
## fct_city date doy year
## Boston MA : 5113 Min. :1960-01-01 Min. : 1.0 Min. :1960
## Chicago IL :23376 1st Qu.:1996-11-10 1st Qu.: 91.0 1st Qu.:1996
## Houston TX : 5113 Median :2013-05-22 Median :183.0 Median :2013
## Las Vegas NV : 5113 Mean :2006-02-14 Mean :182.8 Mean :2006
## Los Angeles CA: 5113 3rd Qu.:2018-08-28 3rd Qu.:274.0 3rd Qu.:2018
## Miami FL : 5113 Max. :2023-12-31 Max. :365.0 Max. :2023
## New York NY : 4914
## meanTemp temp_r21 delta_r21
## Min. :-29.00 Min. :-5.275 Min. :-23.877046
## 1st Qu.: 6.75 1st Qu.: 6.956 1st Qu.: -2.371354
## Median : 16.40 Median :16.192 Median : 0.025680
## Mean : 14.53 Mean :14.532 Mean : -0.000837
## 3rd Qu.: 23.30 3rd Qu.:22.876 3rd Qu.: 2.402418
## Max. : 39.75 Max. :32.763 Max. : 18.633780
##
A baseline ARIMA model is explored, starting with c(0, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) tempDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(0, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
intercept=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef),
se=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$var.coef %>% sqrt),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName intercept se sigma2 sigma
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL -0.0182 0.0293 4.40 2.10
## 2 Los Angeles CA 0.00593 0.0453 10.5 3.24
## 3 Las Vegas NV 0.00745 0.0473 11.4 3.38
## 4 Houston TX -0.0135 0.0517 13.7 3.69
## 5 New York NY -0.00916 0.0540 14.3 3.79
## 6 Boston MA 0.00000680 0.0556 15.8 3.97
## 7 Chicago IL 0.00400 0.0303 21.5 4.64
Lag/lead for temperature vs. 21-day mean is plotted:
tempDaily_vs_r21 %>%
arrange(fct_city, date) %>%
group_by(fct_city) %>%
mutate(tempNext=lead(delta_r21)) %>%
ungroup() %>%
filter(complete.cases(.)) %>%
count(fct_city, delta_r21=round(delta_r21), tempNext=round(tempNext)) %>%
ggplot(aes(x=delta_r21, y=tempNext)) +
geom_point(aes(size=n), alpha=0.1) +
facet_wrap(~fct_city) +
geom_smooth(aes(weight=n), method="lm") +
annotate("point", x=0, y=0, color="red") +
geom_abline(intercept=0, slope=1, color="red", lty=2) +
labs(x="Temperature vs. long-term 21-day mean",
y="Next day's temperature vs. long-term 21-day mean",
title="Temperature vs. long-term 21-day mean by day of year and city"
) +
scale_size_continuous("# Obs")
## `geom_smooth()` using formula = 'y ~ x'
The ARIMA model is explored at c(1, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) tempDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(1, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName sigma2 sigma ar1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 2.12 1.46 0.721 -0.0183
## 2 Los Angeles CA 3.31 1.82 0.827 0.00592
## 3 Las Vegas NV 4.31 2.08 0.789 0.00746
## 4 Houston TX 6.44 2.54 0.727 -0.0134
## 5 New York NY 7.88 2.81 0.671 -0.00917
## 6 Boston MA 9.17 3.03 0.648 0.00000623
## 7 Chicago IL 9.74 3.12 0.740 0.00400
The ARIMA model is explored at c(0, 0, 1):
tstARIMA <- lapply(dewCities,
FUN=function(x) tempDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(0, 0, 1))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName sigma2 sigma ma1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 2.38 1.54 0.696 -0.0181
## 2 Los Angeles CA 4.70 2.17 0.758 0.00586
## 3 Las Vegas NV 5.66 2.38 0.718 0.00737
## 4 Houston TX 6.95 2.64 0.725 -0.0135
## 5 New York NY 8.61 2.93 0.636 -0.00914
## 6 Boston MA 9.58 3.10 0.638 0.0000114
## 7 Chicago IL 11.3 3.36 0.708 0.00400
The ARIMA model is explored at c(1, 0, 1):
tstARIMA <- lapply(dewCities,
FUN=function(x) tempDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(1, 0, 1))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 6
## cityName sigma2 sigma ar1 ma1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 1.99 1.41 0.559 0.356 -0.0183
## 2 Los Angeles CA 2.96 1.72 0.728 0.366 0.00589
## 3 Las Vegas NV 4.06 2.01 0.685 0.298 0.00754
## 4 Houston TX 5.70 2.39 0.542 0.451 -0.0134
## 5 New York NY 7.61 2.76 0.511 0.297 -0.00923
## 6 Boston MA 8.70 2.95 0.449 0.358 -0.0000145
## 7 Chicago IL 9.21 3.03 0.589 0.346 0.00401
The ARIMA model is explored at c(2, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) tempDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(2, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 6
## cityName sigma2 sigma ar1 ar2 intercept
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami FL 2.02 1.42 0.877 -0.217 -0.0181
## 2 Los Angeles CA 2.95 1.72 1.10 -0.330 0.00578
## 3 Las Vegas NV 4.06 2.02 0.978 -0.239 0.00747
## 4 Houston TX 5.82 2.41 0.952 -0.310 -0.0133
## 5 New York NY 7.68 2.77 0.779 -0.162 -0.00904
## 6 Boston MA 8.82 2.97 0.774 -0.195 0.0000453
## 7 Chicago IL 9.35 3.06 0.888 -0.200 0.00400
Sigma-squared for all cities is compiled across models:
pdqList <- list("c(0, 0, 0)"=c(0, 0, 0),
"c(1, 0, 0)"=c(1, 0, 0),
"c(0, 0, 1)"=c(0, 0, 1),
"c(1, 0, 1)"=c(1, 0, 1),
"c(2, 0, 0)"=c(2, 0, 0)
)
pdqAll <- sapply(pdqList, FUN=function(pdq) sapply(dewCities,
FUN=function(x) {
tmp<- tempDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(x=., order=unname(pdq))
tmp$sigma2
}
)
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("ARIMA") %>%
as_tibble()
pdqAll
## # A tibble: 5 × 8
## ARIMA `Boston MA` `Chicago IL` `Houston TX` `Las Vegas NV` `Los Angeles CA`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 c(0, 0,… 15.8 21.5 13.6 11.4 10.5
## 2 c(1, 0,… 9.16 9.74 6.44 4.31 3.31
## 3 c(0, 0,… 9.58 11.3 6.95 5.66 4.70
## 4 c(1, 0,… 8.70 9.21 5.69 4.06 2.96
## 5 c(2, 0,… 8.82 9.35 5.82 4.06 2.95
## # ℹ 2 more variables: `Miami FL` <dbl>, `New York NY` <dbl>
Sigma-squared for all cities is plotted:
pdqAll %>%
pivot_longer(cols=-ARIMA, names_to="city") %>%
group_by(city) %>%
mutate(pct=value/max(value)) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) +
facet_wrap(~city) +
geom_col() +
theme(legend.position = "none") +
geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) +
labs(x="ARIMA(p, d, q)",
y="Sigma-squared\nrelative to ARIMA(0, 0, 0)",
title="Relative sigma-squared by model",
subtitle="(daily temperature vs. long-term 21-day mean for day of year)"
)
The auto.arima() process is run to estimate optimal parameters for all cities:
tstARIMA <- lapply(dewCities,
FUN=function(x) tempDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
forecast::auto.arima(d=0)
)
names(tstARIMA) <- dewCities
The auto.arima() parameters are pulled for all cities:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t()
## ar ma d sig2
## Boston MA 3 3 0 8.63
## Chicago IL 2 4 0 9.12
## Houston TX 4 1 0 5.63
## Las Vegas NV 1 4 0 4.05
## Los Angeles CA 1 5 0 2.94
## Miami FL 2 3 0 1.97
## New York NY 2 3 0 7.51
Sigma-squared for all cities is updated:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("name") %>%
select(name, value=sig2) %>%
mutate(ARIMA="auto") %>%
pivot_wider(id_cols="ARIMA") %>%
bind_rows(pdqAll) %>%
pivot_longer(cols=-ARIMA, names_to="city") %>%
group_by(city) %>%
mutate(pct=value/max(value)) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) +
facet_wrap(~city) +
geom_col() +
theme(legend.position = "none") +
geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) +
labs(x="ARIMA(p, d, q)",
y="Sigma-squared\nrelative to ARIMA(0, 0, 0)",
title="Relative sigma-squared by model",
subtitle="(daily temperature vs. long-term 21-day mean for day of year)"
)
Sigma-squared for all cities is updated, incuding the baseline:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("name") %>%
select(name, value=sig2) %>%
mutate(ARIMA="auto") %>%
pivot_wider(id_cols="ARIMA") %>%
bind_rows(pdqAll) %>%
bind_rows(tempDaily_vs_r21 %>%
group_by(fct_city) %>%
summarize(value=var(meanTemp)) %>%
mutate(ARIMA="base") %>%
pivot_wider(id_cols="ARIMA", names_from="fct_city")
) %>%
pivot_longer(cols=-ARIMA, names_to="city") %>%
group_by(city) %>%
mutate(pct=value/max(value)) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) +
facet_wrap(~city) +
geom_col() +
theme(legend.position = "none") +
geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) +
labs(x="ARIMA(p, d, q)",
y="Sigma-squared\nrelative to ARIMA(0, 0, 0)",
title="Relative sigma-squared by model",
subtitle="(daily temperature vs. long-term 21-day mean for day of year)",
caption="base=daily temperature (no adjustmens)\nc(p, d, q) and auto are all vs. long-term 21-day mean"
)
Daily maximum wind speed data are converted to deviation from long-term rolling 21-day mean by day of year:
windDaily_vs_r21 <- allCityDaily %>%
mutate(doy=pmin(yday(date), 365),
year=year(date),
maxWind=windspeed_10m_max,
fct_city=factor(cityName)
) %>%
select(fct_city, date, doy, year, maxWind) %>%
left_join(allCityDaily %>%
mutate(doy=pmin(yday(date),365),
maxWind=windspeed_10m_max,
fct_city=factor(cityName)
) %>%
arrange(date) %>%
group_by(fct_city) %>%
helperRollingAgg(origVar="maxWind", newName="wind_r21", k=21) %>%
group_by(fct_city, doy) %>%
summarize(wind_r21=mean(wind_r21, na.rm=TRUE), .groups="drop"),
by=c("fct_city", "doy")
) %>%
mutate(delta_r21=maxWind-wind_r21)
windDaily_vs_r21
## # A tibble: 53,855 × 7
## fct_city date doy year maxWind wind_r21 delta_r21
## <fct> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Boston MA 2010-01-01 1 2010 13.1 22.5 -9.35
## 2 Boston MA 2010-01-02 2 2010 34.7 22.6 12.1
## 3 Boston MA 2010-01-03 3 2010 34.8 22.7 12.1
## 4 Boston MA 2010-01-04 4 2010 28.1 22.6 5.50
## 5 Boston MA 2010-01-05 5 2010 17.1 22.5 -5.40
## 6 Boston MA 2010-01-06 6 2010 20.7 22.4 -1.69
## 7 Boston MA 2010-01-07 7 2010 21.1 22.3 -1.18
## 8 Boston MA 2010-01-08 8 2010 21.5 22.2 -0.738
## 9 Boston MA 2010-01-09 9 2010 22 22.3 -0.274
## 10 Boston MA 2010-01-10 10 2010 20 22.4 -2.41
## # ℹ 53,845 more rows
summary(windDaily_vs_r21)
## fct_city date doy year
## Boston MA : 5113 Min. :1960-01-01 Min. : 1.0 Min. :1960
## Chicago IL :23376 1st Qu.:1996-11-10 1st Qu.: 91.0 1st Qu.:1996
## Houston TX : 5113 Median :2013-05-22 Median :183.0 Median :2013
## Las Vegas NV : 5113 Mean :2006-02-14 Mean :182.8 Mean :2006
## Los Angeles CA: 5113 3rd Qu.:2018-08-28 3rd Qu.:274.0 3rd Qu.:2018
## Miami FL : 5113 Max. :2023-12-31 Max. :365.0 Max. :2023
## New York NY : 4914
## maxWind wind_r21 delta_r21
## Min. : 4.30 Min. :11.69 Min. :-18.59129
## 1st Qu.:14.30 1st Qu.:17.66 1st Qu.: -4.38622
## Median :18.90 Median :20.46 Median : -0.68503
## Mean :19.91 Mean :19.91 Mean : -0.00301
## 3rd Qu.:24.50 3rd Qu.:23.23 3rd Qu.: 3.60730
## Max. :97.00 Max. :25.18 Max. : 79.88061
##
Lag/lead for maximum wind speed vs. 21-day mean is plotted:
windDaily_vs_r21 %>%
arrange(fct_city, date) %>%
group_by(fct_city) %>%
mutate(windNext=lead(delta_r21)) %>%
ungroup() %>%
filter(complete.cases(.)) %>%
count(fct_city, delta_r21=round(delta_r21), windNext=round(windNext)) %>%
ggplot(aes(x=delta_r21, y=windNext)) +
geom_point(aes(size=n), alpha=0.1) +
facet_wrap(~fct_city) +
geom_smooth(aes(weight=n), method="lm") +
annotate("point", x=0, y=0, color="red") +
geom_abline(intercept=0, slope=1, color="red", lty=2) +
labs(x="Maximum wind speed vs. long-term 21-day mean",
y="Next day's maximum wind speed vs. long-term 21-day mean",
title="Maximum wind speed vs. long-term 21-day mean by day of year and city"
) +
scale_size_continuous("# Obs")
## `geom_smooth()` using formula = 'y ~ x'
A baseline ARIMA model is explored, starting with c(0, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) windDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(0, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
intercept=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef),
se=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$var.coef %>% sqrt),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName intercept se sigma2 sigma
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Los Angeles CA -0.0117 0.0555 15.8 3.97
## 2 New York NY 0.00404 0.0812 32.4 5.69
## 3 Miami FL 0.00896 0.0827 34.9 5.91
## 4 Houston TX -0.00981 0.0827 35.0 5.91
## 5 Boston MA -0.0122 0.0879 39.5 6.29
## 6 Chicago IL 0.000733 0.0454 48.2 6.95
## 7 Las Vegas NV -0.0141 0.111 62.7 7.92
The ARIMA model is explored at c(1, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) windDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(1, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName sigma2 sigma ar1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Los Angeles CA 12.3 3.51 0.468 -0.0117
## 2 Miami FL 24.9 4.99 0.536 0.00885
## 3 New York NY 28.7 5.36 0.336 0.00409
## 4 Houston TX 28.9 5.38 0.416 -0.00984
## 5 Boston MA 35.7 5.98 0.311 -0.0122
## 6 Chicago IL 41.7 6.46 0.368 0.000734
## 7 Las Vegas NV 48.6 6.97 0.474 -0.0141
The ARIMA model is explored at c(0, 0, 1):
tstARIMA <- lapply(dewCities,
FUN=function(x) windDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(0, 0, 1))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName sigma2 sigma ma1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Los Angeles CA 12.8 3.57 0.419 -0.0118
## 2 Miami FL 25.9 5.09 0.499 0.00896
## 3 New York NY 28.2 5.31 0.382 0.00398
## 4 Houston TX 29.0 5.38 0.415 -0.00976
## 5 Boston MA 35.3 5.94 0.345 -0.0124
## 6 Chicago IL 41.4 6.44 0.383 0.000734
## 7 Las Vegas NV 50.4 7.10 0.426 -0.0142
The ARIMA model is explored at c(1, 0, 1):
tstARIMA <- lapply(dewCities,
FUN=function(x) windDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(1, 0, 1))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 6
## cityName sigma2 sigma ar1 ma1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Los Angeles CA 12.3 3.51 0.476 -0.0104 -0.0116
## 2 Miami FL 24.7 4.97 0.418 0.166 0.00853
## 3 New York NY 28.2 5.31 0.0118 0.372 0.00451
## 4 Houston TX 28.7 5.35 0.234 0.222 -0.00996
## 5 Boston MA 35.3 5.94 0.0187 0.328 -0.0122
## 6 Chicago IL 41.3 6.43 0.144 0.261 0.000744
## 7 Las Vegas NV 48.6 6.97 0.463 0.0145 -0.0140
The ARIMA model is explored at c(2, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) windDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(2, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 6
## cityName sigma2 sigma ar1 ar2 intercept
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Los Angeles CA 12.3 3.51 0.466 0.00332 -0.0117
## 2 Miami FL 24.7 4.97 0.583 -0.0893 0.00861
## 3 New York NY 28.3 5.32 0.378 -0.124 0.00395
## 4 Houston TX 28.7 5.36 0.452 -0.0869 -0.00971
## 5 Boston MA 35.3 5.94 0.344 -0.108 -0.0124
## 6 Chicago IL 41.4 6.43 0.401 -0.0896 0.000733
## 7 Las Vegas NV 48.6 6.97 0.476 -0.00481 -0.0140
Sigma-squared for all cities is compiled across models:
pdqList <- list("c(0, 0, 0)"=c(0, 0, 0),
"c(1, 0, 0)"=c(1, 0, 0),
"c(0, 0, 1)"=c(0, 0, 1),
"c(1, 0, 1)"=c(1, 0, 1),
"c(2, 0, 0)"=c(2, 0, 0)
)
pdqAll <- sapply(pdqList, FUN=function(pdq) sapply(dewCities,
FUN=function(x) {
tmp<- windDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(x=., order=unname(pdq))
tmp$sigma2
}
)
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("ARIMA") %>%
as_tibble()
pdqAll
## # A tibble: 5 × 8
## ARIMA `Boston MA` `Chicago IL` `Houston TX` `Las Vegas NV` `Los Angeles CA`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 c(0, 0,… 39.5 48.2 35.0 62.6 15.7
## 2 c(1, 0,… 35.7 41.7 28.9 48.6 12.3
## 3 c(0, 0,… 35.3 41.4 29.0 50.4 12.8
## 4 c(1, 0,… 35.3 41.3 28.7 48.6 12.3
## 5 c(2, 0,… 35.3 41.4 28.7 48.6 12.3
## # ℹ 2 more variables: `Miami FL` <dbl>, `New York NY` <dbl>
Sigma-squared for all cities is plotted:
pdqAll %>%
pivot_longer(cols=-ARIMA, names_to="city") %>%
group_by(city) %>%
mutate(pct=value/max(value)) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) +
facet_wrap(~city) +
geom_col() +
theme(legend.position = "none") +
geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) +
labs(x="ARIMA(p, d, q)",
y="Sigma-squared\nrelative to ARIMA(0, 0, 0)",
title="Relative sigma-squared by model",
subtitle="(daily maximum wind speed vs. long-term 21-day mean for day of year)"
)
The auto.arima() process is run to estimate optimal parameters for all cities:
tstARIMA <- lapply(dewCities,
FUN=function(x) windDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
forecast::auto.arima(d=0)
)
names(tstARIMA) <- dewCities
The auto.arima() parameters are pulled for all cities:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t()
## ar ma d sig2
## Boston MA 2 2 0 35.31
## Chicago IL 1 1 0 41.32
## Houston TX 5 0 0 28.54
## Las Vegas NV 1 0 0 48.57
## Los Angeles CA 1 0 0 12.30
## Miami FL 1 1 0 24.73
## New York NY 4 2 0 28.20
Sigma-squared for all cities is updated:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("name") %>%
select(name, value=sig2) %>%
mutate(ARIMA="auto") %>%
pivot_wider(id_cols="ARIMA") %>%
bind_rows(pdqAll) %>%
pivot_longer(cols=-ARIMA, names_to="city") %>%
group_by(city) %>%
mutate(pct=value/max(value)) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) +
facet_wrap(~city) +
geom_col() +
theme(legend.position = "none") +
geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) +
labs(x="ARIMA(p, d, q)",
y="Sigma-squared\nrelative to ARIMA(0, 0, 0)",
title="Relative sigma-squared by model",
subtitle="(daily maximum wind speed vs. long-term 21-day mean for day of year)"
)
Sigma-squared for all cities is updated, incuding the baseline:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("name") %>%
select(name, value=sig2) %>%
mutate(ARIMA="auto") %>%
pivot_wider(id_cols="ARIMA") %>%
bind_rows(pdqAll) %>%
bind_rows(windDaily_vs_r21 %>%
group_by(fct_city) %>%
summarize(value=var(maxWind)) %>%
mutate(ARIMA="base") %>%
pivot_wider(id_cols="ARIMA", names_from="fct_city")
) %>%
pivot_longer(cols=-ARIMA, names_to="city") %>%
group_by(city) %>%
mutate(pct=value/max(value)) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) +
facet_wrap(~city) +
geom_col() +
theme(legend.position = "none") +
geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) +
labs(x="ARIMA(p, d, q)",
y="Sigma-squared\nrelative to ARIMA(0, 0, 0)",
title="Relative sigma-squared by model",
subtitle="(daily maximum wind speed vs. long-term 21-day mean for day of year)",
caption="base=daily max wind speed (no adjustmens)\nc(p, d, q) and auto are all vs. long-term 21-day mean"
)
Daily precipitation data are converted to deviation from long-term rolling 21-day mean by day of year:
precipDaily_vs_r21 <- allCityDaily %>%
mutate(doy=pmin(yday(date), 365),
year=year(date),
sumPrecip=precipitation_sum,
fct_city=factor(cityName)
) %>%
select(fct_city, date, doy, year, sumPrecip) %>%
left_join(allCityDaily %>%
mutate(doy=pmin(yday(date),365),
sumPrecip=precipitation_sum,
fct_city=factor(cityName)
) %>%
arrange(date) %>%
group_by(fct_city) %>%
helperRollingAgg(origVar="sumPrecip", newName="precip_r21", k=21) %>%
group_by(fct_city, doy) %>%
summarize(precip_r21=mean(precip_r21, na.rm=TRUE), .groups="drop"),
by=c("fct_city", "doy")
) %>%
mutate(delta_r21=sumPrecip-precip_r21)
precipDaily_vs_r21
## # A tibble: 53,855 × 7
## fct_city date doy year sumPrecip precip_r21 delta_r21
## <fct> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Boston MA 2010-01-01 1 2010 0.5 3.17 -2.67
## 2 Boston MA 2010-01-02 2 2010 8.9 3.42 5.48
## 3 Boston MA 2010-01-03 3 2010 6.7 3.22 3.48
## 4 Boston MA 2010-01-04 4 2010 0 3.19 -3.19
## 5 Boston MA 2010-01-05 5 2010 0 2.91 -2.91
## 6 Boston MA 2010-01-06 6 2010 0 3.02 -3.02
## 7 Boston MA 2010-01-07 7 2010 0 2.89 -2.89
## 8 Boston MA 2010-01-08 8 2010 0.2 3.08 -2.88
## 9 Boston MA 2010-01-09 9 2010 0 2.79 -2.79
## 10 Boston MA 2010-01-10 10 2010 0 2.93 -2.93
## # ℹ 53,845 more rows
summary(precipDaily_vs_r21)
## fct_city date doy year
## Boston MA : 5113 Min. :1960-01-01 Min. : 1.0 Min. :1960
## Chicago IL :23376 1st Qu.:1996-11-10 1st Qu.: 91.0 1st Qu.:1996
## Houston TX : 5113 Median :2013-05-22 Median :183.0 Median :2013
## Las Vegas NV : 5113 Mean :2006-02-14 Mean :182.8 Mean :2006
## Los Angeles CA: 5113 3rd Qu.:2018-08-28 3rd Qu.:274.0 3rd Qu.:2018
## Miami FL : 5113 Max. :2023-12-31 Max. :365.0 Max. :2023
## New York NY : 4914
## sumPrecip precip_r21 delta_r21
## Min. : 0.000 Min. :0.01599 Min. : -7.33469
## 1st Qu.: 0.000 1st Qu.:2.01905 1st Qu.: -2.95850
## Median : 0.000 Median :2.85204 Median : -1.93698
## Mean : 2.601 Mean :2.60069 Mean : -0.00017
## 3rd Qu.: 1.800 3rd Qu.:3.26094 3rd Qu.: -0.16224
## Max. :143.900 Max. :7.33469 Max. :138.16463
##
Lag/lead for precipitation vs. 21-day mean is plotted:
precipDaily_vs_r21 %>%
arrange(fct_city, date) %>%
group_by(fct_city) %>%
mutate(precipNext=lead(delta_r21)) %>%
ungroup() %>%
filter(complete.cases(.)) %>%
count(fct_city, delta_r21=round(delta_r21), precipNext=round(precipNext)) %>%
ggplot(aes(x=delta_r21, y=precipNext)) +
geom_point(aes(size=n), alpha=0.1) +
facet_wrap(~fct_city) +
geom_smooth(aes(weight=n), method="lm") +
annotate("point", x=0, y=0, color="red") +
geom_abline(intercept=0, slope=1, color="red", lty=2) +
labs(x="Precipitation vs. long-term 21-day mean",
y="Next day's precipitation vs. long-term 21-day mean",
title="Precipitation vs. long-term 21-day mean by day of year and city"
) +
scale_size_continuous("# Obs")
## `geom_smooth()` using formula = 'y ~ x'
A baseline ARIMA model is explored, starting with c(0, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) precipDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(0, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
intercept=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef),
se=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$var.coef %>% sqrt),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName intercept se sigma2 sigma
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Las Vegas NV -0.00184 0.0263 3.53 1.88
## 2 Los Angeles CA 0.00116 0.0722 26.7 5.16
## 3 Chicago IL 0.000684 0.0398 37.1 6.09
## 4 New York NY -0.00681 0.101 49.9 7.06
## 5 Boston MA -0.000555 0.104 54.9 7.41
## 6 Miami FL 0.00843 0.108 60.1 7.75
## 7 Houston TX -0.00562 0.140 101. 10.0
The ARIMA model is explored at c(1, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) precipDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(1, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName sigma2 sigma ar1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Las Vegas NV 3.36 1.83 0.224 -0.00184
## 2 Los Angeles CA 24.6 4.96 0.275 0.00118
## 3 Chicago IL 36.3 6.03 0.145 0.000685
## 4 New York NY 49.1 7.01 0.126 -0.00680
## 5 Miami FL 54.1 7.36 0.315 0.00842
## 6 Boston MA 54.3 7.37 0.102 -0.000531
## 7 Houston TX 89.3 9.45 0.336 -0.00560
The ARIMA model is explored at c(0, 0, 1):
tstARIMA <- lapply(dewCities,
FUN=function(x) precipDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(0, 0, 1))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName sigma2 sigma ma1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Las Vegas NV 3.37 1.84 0.205 -0.00185
## 2 Los Angeles CA 24.8 4.98 0.255 0.00113
## 3 Chicago IL 36.3 6.03 0.149 0.000684
## 4 New York NY 49.0 7.00 0.144 -0.00682
## 5 Boston MA 54.2 7.36 0.115 -0.000567
## 6 Miami FL 54.4 7.38 0.304 0.00843
## 7 Houston TX 90.6 9.52 0.300 -0.00565
The ARIMA model is explored at c(1, 0, 1):
tstARIMA <- lapply(dewCities,
FUN=function(x) precipDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(1, 0, 1))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 6
## cityName sigma2 sigma ar1 ma1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Las Vegas NV 3.36 1.83 0.315 -0.0952 -0.00173
## 2 Los Angeles CA 24.6 4.96 0.376 -0.110 0.00170
## 3 Chicago IL 36.3 6.03 -0.0159 0.164 0.000715
## 4 New York NY 48.9 6.99 -0.288 0.426 -0.00709
## 5 Boston MA 54.1 7.36 -0.343 0.454 -0.000277
## 6 Miami FL 54.1 7.36 0.261 0.0599 0.00833
## 7 Houston TX 89.2 9.45 0.398 -0.0702 -0.00529
The ARIMA model is explored at c(2, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) precipDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(2, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 6
## cityName sigma2 sigma ar1 ar2 intercept
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Las Vegas NV 3.36 1.83 0.219 0.0234 -0.00183
## 2 Los Angeles CA 24.6 4.96 0.270 0.0209 0.00122
## 3 Chicago IL 36.3 6.03 0.148 -0.0217 0.000685
## 4 New York NY 48.8 6.99 0.136 -0.0719 -0.00678
## 5 Boston MA 54.1 7.35 0.109 -0.0632 -0.000604
## 6 Miami FL 54.1 7.36 0.319 -0.0141 0.00845
## 7 Houston TX 89.2 9.45 0.328 0.0218 -0.00552
Sigma-squared for all cities is compiled across models:
pdqList <- list("c(0, 0, 0)"=c(0, 0, 0),
"c(1, 0, 0)"=c(1, 0, 0),
"c(0, 0, 1)"=c(0, 0, 1),
"c(1, 0, 1)"=c(1, 0, 1),
"c(2, 0, 0)"=c(2, 0, 0)
)
pdqAll <- sapply(pdqList, FUN=function(pdq) sapply(dewCities,
FUN=function(x) {
tmp<- precipDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(x=., order=unname(pdq))
tmp$sigma2
}
)
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("ARIMA") %>%
as_tibble()
pdqAll
## # A tibble: 5 × 8
## ARIMA `Boston MA` `Chicago IL` `Houston TX` `Las Vegas NV` `Los Angeles CA`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 c(0, 0,… 54.8 37.1 101. 3.53 26.6
## 2 c(1, 0,… 54.3 36.3 89.3 3.36 24.6
## 3 c(0, 0,… 54.2 36.3 90.6 3.37 24.8
## 4 c(1, 0,… 54.1 36.3 89.2 3.35 24.6
## 5 c(2, 0,… 54.1 36.3 89.2 3.35 24.6
## # ℹ 2 more variables: `Miami FL` <dbl>, `New York NY` <dbl>
Sigma-squared for all cities is plotted:
pdqAll %>%
pivot_longer(cols=-ARIMA, names_to="city") %>%
group_by(city) %>%
mutate(pct=value/max(value)) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) +
facet_wrap(~city) +
geom_col() +
theme(legend.position = "none") +
geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) +
labs(x="ARIMA(p, d, q)",
y="Sigma-squared\nrelative to ARIMA(0, 0, 0)",
title="Relative sigma-squared by model",
subtitle="(daily precipitation vs. long-term 21-day mean for day of year)"
)
The auto.arima() process is run to estimate optimal parameters for all cities:
tstARIMA <- lapply(dewCities,
FUN=function(x) precipDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
forecast::auto.arima(d=0)
)
names(tstARIMA) <- dewCities
The auto.arima() parameters are pulled for all cities:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t()
## ar ma d sig2
## Boston MA 0 2 0 54.08
## Chicago IL 0 1 0 36.32
## Houston TX 1 1 0 89.24
## Las Vegas NV 0 2 0 3.36
## Los Angeles CA 1 2 0 24.55
## Miami FL 1 2 0 54.07
## New York NY 1 2 0 48.84
Sigma-squared for all cities is updated:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("name") %>%
select(name, value=sig2) %>%
mutate(ARIMA="auto") %>%
pivot_wider(id_cols="ARIMA") %>%
bind_rows(pdqAll) %>%
pivot_longer(cols=-ARIMA, names_to="city") %>%
group_by(city) %>%
mutate(pct=value/max(value)) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) +
facet_wrap(~city) +
geom_col() +
theme(legend.position = "none") +
geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) +
labs(x="ARIMA(p, d, q)",
y="Sigma-squared\nrelative to ARIMA(0, 0, 0)",
title="Relative sigma-squared by model",
subtitle="(daily precipitation vs. long-term 21-day mean for day of year)"
)
Sigma-squared for all cities is updated, incuding the baseline:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("name") %>%
select(name, value=sig2) %>%
mutate(ARIMA="auto") %>%
pivot_wider(id_cols="ARIMA") %>%
bind_rows(pdqAll) %>%
bind_rows(precipDaily_vs_r21 %>%
group_by(fct_city) %>%
summarize(value=var(sumPrecip)) %>%
mutate(ARIMA="base") %>%
pivot_wider(id_cols="ARIMA", names_from="fct_city")
) %>%
pivot_longer(cols=-ARIMA, names_to="city") %>%
group_by(city) %>%
mutate(pct=value/max(value)) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) +
facet_wrap(~city) +
geom_col() +
theme(legend.position = "none") +
geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) +
labs(x="ARIMA(p, d, q)",
y="Sigma-squared\nrelative to ARIMA(0, 0, 0)",
title="Relative sigma-squared by model",
subtitle="(daily precipitation vs. long-term 21-day mean for day of year)",
caption="base=daily precipitation (no adjustmens)\nc(p, d, q) and auto are all vs. long-term 21-day mean"
)
Daily shortwave radiation data are converted to deviation from long-term rolling 21-day mean by day of year:
swrDaily_vs_r21 <- allCityDaily %>%
mutate(doy=pmin(yday(date), 365),
year=year(date),
sumSWR=shortwave_radiation_sum,
fct_city=factor(cityName)
) %>%
select(fct_city, date, doy, year, sumSWR) %>%
left_join(allCityDaily %>%
mutate(doy=pmin(yday(date),365),
sumSWR=shortwave_radiation_sum,
fct_city=factor(cityName)
) %>%
arrange(date) %>%
group_by(fct_city) %>%
helperRollingAgg(origVar="sumSWR", newName="swr_r21", k=21) %>%
group_by(fct_city, doy) %>%
summarize(swr_r21=mean(swr_r21, na.rm=TRUE), .groups="drop"),
by=c("fct_city", "doy")
) %>%
mutate(delta_r21=sumSWR-swr_r21)
swrDaily_vs_r21
## # A tibble: 53,855 × 7
## fct_city date doy year sumSWR swr_r21 delta_r21
## <fct> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Boston MA 2010-01-01 1 2010 6.58 6.01 0.565
## 2 Boston MA 2010-01-02 2 2010 3.86 6.03 -2.17
## 3 Boston MA 2010-01-03 3 2010 4.04 6.13 -2.09
## 4 Boston MA 2010-01-04 4 2010 7.66 6.12 1.54
## 5 Boston MA 2010-01-05 5 2010 7.55 6.18 1.37
## 6 Boston MA 2010-01-06 6 2010 8.26 6.18 2.08
## 7 Boston MA 2010-01-07 7 2010 8.82 6.29 2.53
## 8 Boston MA 2010-01-08 8 2010 6.22 6.31 -0.0854
## 9 Boston MA 2010-01-09 9 2010 8.85 6.39 2.46
## 10 Boston MA 2010-01-10 10 2010 9.24 6.45 2.79
## # ℹ 53,845 more rows
summary(swrDaily_vs_r21)
## fct_city date doy year
## Boston MA : 5113 Min. :1960-01-01 Min. : 1.0 Min. :1960
## Chicago IL :23376 1st Qu.:1996-11-10 1st Qu.: 91.0 1st Qu.:1996
## Houston TX : 5113 Median :2013-05-22 Median :183.0 Median :2013
## Las Vegas NV : 5113 Mean :2006-02-14 Mean :182.8 Mean :2006
## Los Angeles CA: 5113 3rd Qu.:2018-08-28 3rd Qu.:274.0 3rd Qu.:2018
## Miami FL : 5113 Max. :2023-12-31 Max. :365.0 Max. :2023
## New York NY : 4914
## sumSWR swr_r21 delta_r21
## Min. : 0.40 Min. : 5.561 Min. :-23.212381
## 1st Qu.: 9.99 1st Qu.:10.960 1st Qu.: -2.274057
## Median :16.20 Median :16.771 Median : 0.956395
## Mean :16.31 Mean :16.310 Mean : 0.000783
## 3rd Qu.:22.75 3rd Qu.:21.479 3rd Qu.: 2.923078
## Max. :33.04 Max. :30.328 Max. : 10.525340
##
Lag/lead for shortwave radiation vs. 21-day mean is plotted:
swrDaily_vs_r21 %>%
arrange(fct_city, date) %>%
group_by(fct_city) %>%
mutate(swrNext=lead(delta_r21)) %>%
ungroup() %>%
filter(complete.cases(.)) %>%
count(fct_city, delta_r21=round(delta_r21), swrNext=round(swrNext)) %>%
ggplot(aes(x=delta_r21, y=swrNext)) +
geom_point(aes(size=n), alpha=0.1) +
facet_wrap(~fct_city) +
geom_smooth(aes(weight=n), method="lm") +
annotate("point", x=0, y=0, color="red") +
geom_abline(intercept=0, slope=1, color="red", lty=2) +
labs(x="Shortwave radiation vs. long-term 21-day mean",
y="Next day's shortwave radiation vs. long-term 21-day mean",
title="Shortwave radiation vs. long-term 21-day mean by day of year and city"
) +
scale_size_continuous("# Obs")
## `geom_smooth()` using formula = 'y ~ x'
A baseline ARIMA model is explored, starting with c(0, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) swrDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(0, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
intercept=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef),
se=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$var.coef %>% sqrt),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName intercept se sigma2 sigma
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Las Vegas NV 0.00123 0.0367 6.88 2.62
## 2 Los Angeles CA -0.00246 0.0466 11.1 3.33
## 3 Miami FL -0.00172 0.0509 13.2 3.64
## 4 Chicago IL -0.000810 0.0301 21.1 4.60
## 5 Houston TX 0.00746 0.0681 23.7 4.87
## 6 Boston MA 0.000330 0.0739 27.9 5.28
## 7 New York NY 0.00740 0.0772 29.3 5.41
The ARIMA model is explored at c(1, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) swrDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(1, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName sigma2 sigma ar1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Las Vegas NV 5.62 2.37 0.429 0.00123
## 2 Los Angeles CA 8.58 2.93 0.476 -0.00247
## 3 Miami FL 10.7 3.27 0.440 -0.00170
## 4 Houston TX 18.8 4.33 0.456 0.00746
## 5 Chicago IL 19.8 4.45 0.251 -0.000811
## 6 Boston MA 26.5 5.15 0.228 0.000330
## 7 New York NY 27.7 5.26 0.238 0.00741
The ARIMA model is explored at c(0, 0, 1):
tstARIMA <- lapply(dewCities,
FUN=function(x) swrDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(0, 0, 1))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 5
## cityName sigma2 sigma ma1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Las Vegas NV 5.85 2.42 0.360 0.00123
## 2 Los Angeles CA 8.98 3.00 0.414 -0.00249
## 3 Miami FL 11.0 3.32 0.382 -0.00171
## 4 Houston TX 19.3 4.39 0.415 0.00747
## 5 Chicago IL 19.9 4.46 0.245 -0.000811
## 6 Boston MA 26.5 5.14 0.232 0.000342
## 7 New York NY 27.6 5.26 0.240 0.00740
The ARIMA model is explored at c(1, 0, 1):
tstARIMA <- lapply(dewCities,
FUN=function(x) swrDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(1, 0, 1))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 6
## cityName sigma2 sigma ar1 ma1 intercept
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Las Vegas NV 5.60 2.37 0.546 -0.145 0.00119
## 2 Los Angeles CA 8.57 2.93 0.524 -0.0630 -0.00249
## 3 Miami FL 10.7 3.27 0.440 -0.000566 -0.00150
## 4 Houston TX 18.7 4.33 0.383 0.0913 0.00745
## 5 Chicago IL 19.8 4.45 0.202 0.0528 -0.000826
## 6 Boston MA 26.4 5.14 0.0834 0.153 0.000308
## 7 New York NY 27.6 5.26 0.104 0.142 0.00755
The ARIMA model is explored at c(2, 0, 0):
tstARIMA <- lapply(dewCities,
FUN=function(x) swrDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(order=c(2, 0, 0))
)
names(tstARIMA) <- dewCities
tibble::tibble(cityName=names(tstARIMA),
sigma2=sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$residuals %>% var),
sigma=sqrt(sigma2)
) %>%
inner_join(sapply(names(tstARIMA), FUN=function(x) tstARIMA[[x]]$coef) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("cityName") %>%
as_tibble(),
by="cityName"
) %>%
arrange(sigma)
## # A tibble: 7 × 6
## cityName sigma2 sigma ar1 ar2 intercept
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Las Vegas NV 5.61 2.37 0.408 0.0491 0.00120
## 2 Los Angeles CA 8.57 2.93 0.465 0.0215 -0.00251
## 3 Miami FL 10.7 3.27 0.439 0.000340 -0.00174
## 4 Houston TX 18.8 4.33 0.473 -0.0374 0.00740
## 5 Chicago IL 19.8 4.45 0.254 -0.0120 -0.000809
## 6 Boston MA 26.4 5.14 0.236 -0.0344 0.000367
## 7 New York NY 27.6 5.26 0.246 -0.0338 0.00741
Sigma-squared for all cities is compiled across models:
pdqList <- list("c(0, 0, 0)"=c(0, 0, 0),
"c(1, 0, 0)"=c(1, 0, 0),
"c(0, 0, 1)"=c(0, 0, 1),
"c(1, 0, 1)"=c(1, 0, 1),
"c(2, 0, 0)"=c(2, 0, 0)
)
pdqAll <- sapply(pdqList, FUN=function(pdq) sapply(dewCities,
FUN=function(x) {
tmp<- swrDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
arima(x=., order=unname(pdq))
tmp$sigma2
}
)
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("ARIMA") %>%
as_tibble()
pdqAll
## # A tibble: 5 × 8
## ARIMA `Boston MA` `Chicago IL` `Houston TX` `Las Vegas NV` `Los Angeles CA`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 c(0, 0,… 27.9 21.1 23.7 6.88 11.1
## 2 c(1, 0,… 26.5 19.8 18.8 5.62 8.58
## 3 c(0, 0,… 26.4 19.9 19.3 5.85 8.97
## 4 c(1, 0,… 26.4 19.8 18.7 5.60 8.57
## 5 c(2, 0,… 26.4 19.8 18.7 5.60 8.57
## # ℹ 2 more variables: `Miami FL` <dbl>, `New York NY` <dbl>
Sigma-squared for all cities is plotted:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("name") %>%
select(name, value=sig2) %>%
mutate(ARIMA="auto") %>%
pivot_wider(id_cols="ARIMA") %>%
bind_rows(pdqAll) %>%
pivot_longer(cols=-ARIMA, names_to="city") %>%
group_by(city) %>%
mutate(pct=value/max(value)) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) +
facet_wrap(~city) +
geom_col() +
theme(legend.position = "none") +
geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) +
labs(x="ARIMA(p, d, q)",
y="Sigma-squared\nrelative to ARIMA(0, 0, 0)",
title="Relative sigma-squared by model",
subtitle="(daily shortwave radiation vs. long-term 21-day mean for day of year)"
)
The auto.arima() process is run to estimate optimal parameters for all cities:
tstARIMA <- lapply(dewCities,
FUN=function(x) swrDaily_vs_r21 %>%
arrange(date) %>%
filter(fct_city==x) %>%
pull(delta_r21) %>%
forecast::auto.arima(d=0)
)
names(tstARIMA) <- dewCities
The auto.arima() parameters are pulled for all cities:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t()
## ar ma d sig2
## Boston MA 0 1 0 26.45
## Chicago IL 0 3 0 19.81
## Houston TX 1 3 0 18.60
## Las Vegas NV 2 1 0 5.51
## Los Angeles CA 3 1 0 8.47
## Miami FL 1 0 0 10.68
## New York NY 0 2 0 27.63
Sigma-squared for all cities is updated:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("name") %>%
select(name, value=sig2) %>%
mutate(ARIMA="auto") %>%
pivot_wider(id_cols="ARIMA") %>%
bind_rows(pdqAll) %>%
pivot_longer(cols=-ARIMA, names_to="city") %>%
group_by(city) %>%
mutate(pct=value/max(value)) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) +
facet_wrap(~city) +
geom_col() +
theme(legend.position = "none") +
geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) +
labs(x="ARIMA(p, d, q)",
y="Sigma-squared\nrelative to ARIMA(0, 0, 0)",
title="Relative sigma-squared by model",
subtitle="(daily shortwave radiation vs. long-term 21-day mean for day of year)"
)
Sigma-squared for all cities is updated, incuding the baseline:
sapply(tstARIMA,
FUN=function(x) c(ar=x$arma[1], ma=x$arma[2], d=x$arma[6], sig2=round(x$sigma2,2))
) %>%
t() %>%
as.data.frame() %>%
rownames_to_column("name") %>%
select(name, value=sig2) %>%
mutate(ARIMA="auto") %>%
pivot_wider(id_cols="ARIMA") %>%
bind_rows(pdqAll) %>%
bind_rows(swrDaily_vs_r21 %>%
group_by(fct_city) %>%
summarize(value=var(sumSWR)) %>%
mutate(ARIMA="base") %>%
pivot_wider(id_cols="ARIMA", names_from="fct_city")
) %>%
pivot_longer(cols=-ARIMA, names_to="city") %>%
group_by(city) %>%
mutate(pct=value/max(value)) %>%
ungroup() %>%
ggplot(aes(x=fct_reorder(ARIMA, -pct), y=pct, fill=city)) +
facet_wrap(~city) +
geom_col() +
theme(legend.position = "none") +
geom_text(data=~filter(., pct<1), aes(y=pct/2, label=round(pct, 2)), size=3) +
labs(x="ARIMA(p, d, q)",
y="Sigma-squared\nrelative to ARIMA(0, 0, 0)",
title="Relative sigma-squared by model",
subtitle="(daily shortwave radiation vs. long-term 21-day mean for day of year)",
caption="base=daily SWR (no adjustmens)\nc(p, d, q) and auto are all vs. long-term 21-day mean"
)
ACF is calculated for shortwave radiation for each city:
dewCities <- dewPDaily %>%
pull(fct_city) %>%
as.vector() %>%
unique()
swrACF <- lapply(dewCities,
FUN=function(x) {
swrDaily_vs_r21 %>%
filter(fct_city==x) %>%
arrange(date) %>%
select(sumSWR) %>%
acf(lag.max=2000, plot=FALSE)
}
)
names(swrACF) <- dewCities
swrACFAll <- map_dfr(.x=dewCities,
.f=function(x) tibble::tibble(lag=as.vector(swrACF[[x]][["lag"]]),
acf=as.vector(swrACF[[x]][["acf"]])
),
.id="src") %>%
mutate(cityName=dewCities[as.integer(src)])
swrACFAll
## # A tibble: 14,007 × 4
## src lag acf cityName
## <chr> <dbl> <dbl> <chr>
## 1 1 0 1 Boston MA
## 2 1 1 0.640 Boston MA
## 3 1 2 0.542 Boston MA
## 4 1 3 0.533 Boston MA
## 5 1 4 0.527 Boston MA
## 6 1 5 0.524 Boston MA
## 7 1 6 0.521 Boston MA
## 8 1 7 0.530 Boston MA
## 9 1 8 0.518 Boston MA
## 10 1 9 0.514 Boston MA
## # ℹ 13,997 more rows
swrACFAll %>%
filter(lag > 0) %>%
group_by(cityName) %>%
summarize(mu_abs_acf=mean(abs(acf))) %>%
arrange(desc(mu_abs_acf))
## # A tibble: 7 × 2
## cityName mu_abs_acf
## <chr> <dbl>
## 1 Las Vegas NV 0.437
## 2 Chicago IL 0.399
## 3 Los Angeles CA 0.397
## 4 Boston MA 0.270
## 5 New York NY 0.246
## 6 Miami FL 0.243
## 7 Houston TX 0.229
ACF by city are plotted:
swrACFAll %>%
filter(lag>0) %>%
ggplot(aes(x=lag, y=acf)) +
geom_line(aes(color=cityName, group=cityName), lwd=1) +
labs(x="Lag (zero excluded)", y="ACF", title="ACF of shortwave radiation by city") +
scale_color_discrete(NULL) +
lims(y=c(-1, 1))
PACF is calculated for shortwave radiation for each city:
swrPACF <- lapply(dewCities,
FUN=function(x) {
swrDaily_vs_r21 %>%
filter(fct_city==x) %>%
arrange(date) %>%
select(sumSWR) %>%
pacf(lag.max=2000, plot=FALSE)
}
)
names(swrPACF) <- dewCities
swrPACFAll <- map_dfr(.x=dewCities,
.f=function(x) tibble::tibble(lag=as.vector(swrPACF[[x]][["lag"]]),
pacf=as.vector(swrPACF[[x]][["acf"]])
),
.id="src") %>%
mutate(cityName=dewCities[as.integer(src)])
swrPACFAll
## # A tibble: 14,000 × 4
## src lag pacf cityName
## <chr> <dbl> <dbl> <chr>
## 1 1 1 0.640 Boston MA
## 2 1 2 0.225 Boston MA
## 3 1 3 0.214 Boston MA
## 4 1 4 0.162 Boston MA
## 5 1 5 0.140 Boston MA
## 6 1 6 0.118 Boston MA
## 7 1 7 0.127 Boston MA
## 8 1 8 0.0780 Boston MA
## 9 1 9 0.0802 Boston MA
## 10 1 10 0.107 Boston MA
## # ℹ 13,990 more rows
swrPACFAll %>%
filter(lag > 0) %>%
group_by(cityName) %>%
summarize(mu_abs_pacf=mean(abs(pacf))) %>%
arrange(desc(mu_abs_pacf))
## # A tibble: 7 × 2
## cityName mu_abs_pacf
## <chr> <dbl>
## 1 Las Vegas NV 0.0114
## 2 Boston MA 0.0114
## 3 Houston TX 0.0113
## 4 New York NY 0.0112
## 5 Los Angeles CA 0.0111
## 6 Miami FL 0.0109
## 7 Chicago IL 0.00729
PACF by city are plotted:
swrPACFAll %>%
filter(lag>0) %>%
ggplot(aes(x=lag, y=pacf)) +
geom_line(aes(color=cityName, group=cityName), lwd=1) +
labs(x="Lag (zero excluded)", y="PACF", title="PACF of shortwave radiation by city") +
scale_color_discrete(NULL) +
lims(y=c(-1, 1))
swrPACFAll %>%
filter(lag>0, lag<=50) %>%
ggplot(aes(x=lag, y=pacf)) +
geom_line(aes(color=cityName, group=cityName), lwd=1) +
labs(x="Lag (zero excluded)", y="PACF", title="PACF of shortwave radiation by city") +
scale_color_discrete(NULL) +
lims(y=c(-1, 1))
ACF is calculated and plotted for seasonally-adjusted shortwave radiation for each city:
swrACF_r21 <- lapply(dewCities,
FUN=function(x) {
swrDaily_vs_r21 %>%
filter(fct_city==x) %>%
arrange(date) %>%
select(delta_r21) %>%
acf(lag.max=2000, plot=FALSE)
}
)
names(swrACF_r21) <- dewCities
swrACFAll_r21 <- map_dfr(.x=dewCities,
.f=function(x) tibble::tibble(lag=as.vector(swrACF_r21[[x]][["lag"]]),
acf=as.vector(swrACF_r21[[x]][["acf"]])
),
.id="src") %>%
mutate(cityName=dewCities[as.integer(src)])
swrACFAll_r21
## # A tibble: 14,007 × 4
## src lag acf cityName
## <chr> <dbl> <dbl> <chr>
## 1 1 0 1 Boston MA
## 2 1 1 0.228 Boston MA
## 3 1 2 0.0193 Boston MA
## 4 1 3 0.000369 Boston MA
## 5 1 4 -0.0106 Boston MA
## 6 1 5 -0.0151 Boston MA
## 7 1 6 -0.0192 Boston MA
## 8 1 7 0.00412 Boston MA
## 9 1 8 -0.0189 Boston MA
## 10 1 9 -0.0232 Boston MA
## # ℹ 13,997 more rows
swrACFAll_r21 %>%
filter(lag > 0) %>%
group_by(cityName) %>%
summarize(mu_abs_acf=mean(abs(acf))) %>%
arrange(desc(mu_abs_acf))
## # A tibble: 7 × 2
## cityName mu_abs_acf
## <chr> <dbl>
## 1 Las Vegas NV 0.0196
## 2 Los Angeles CA 0.0161
## 3 Houston TX 0.0148
## 4 Miami FL 0.0124
## 5 Boston MA 0.0108
## 6 New York NY 0.0107
## 7 Chicago IL 0.00592
swrACFAll_r21 %>%
filter(lag>0) %>%
ggplot(aes(x=lag, y=acf)) +
geom_line(aes(color=cityName, group=cityName), lwd=1) +
labs(x="Lag (zero excluded)", y="ACF", title="ACF of seasonally-adjusted shortwave radiation by city") +
scale_color_discrete(NULL) +
lims(y=c(-1, 1))
swrACFAll_r21 %>%
filter(lag>0, lag<=30) %>%
ggplot(aes(x=lag, y=acf)) +
geom_line(aes(color=cityName, group=cityName), lwd=1) +
labs(x="Lag (zero excluded)", y="ACF", title="ACF of seasonally-adjusted shortwave radiation by city") +
scale_color_discrete(NULL) +
lims(y=c(-1, 1))
PACF is calculated and plotted for seasonally-adjusted shortwave radiation for each city:
swrPACF_r21 <- lapply(dewCities,
FUN=function(x) {
swrDaily_vs_r21 %>%
filter(fct_city==x) %>%
arrange(date) %>%
select(delta_r21) %>%
pacf(lag.max=2000, plot=FALSE)
}
)
names(swrPACF_r21) <- dewCities
swrPACFAll_r21 <- map_dfr(.x=dewCities,
.f=function(x) tibble::tibble(lag=as.vector(swrPACF_r21[[x]][["lag"]]),
acf=as.vector(swrPACF_r21[[x]][["acf"]])
),
.id="src") %>%
mutate(cityName=dewCities[as.integer(src)])
swrPACFAll_r21
## # A tibble: 14,000 × 4
## src lag acf cityName
## <chr> <dbl> <dbl> <chr>
## 1 1 1 0.228 Boston MA
## 2 1 2 -0.0344 Boston MA
## 3 1 3 0.00385 Boston MA
## 4 1 4 -0.0115 Boston MA
## 5 1 5 -0.0106 Boston MA
## 6 1 6 -0.0141 Boston MA
## 7 1 7 0.0122 Boston MA
## 8 1 8 -0.0243 Boston MA
## 9 1 9 -0.0143 Boston MA
## 10 1 10 0.0255 Boston MA
## # ℹ 13,990 more rows
swrPACFAll_r21 %>%
filter(lag > 0) %>%
group_by(cityName) %>%
summarize(mu_abs_acf=mean(abs(acf))) %>%
arrange(desc(mu_abs_acf))
## # A tibble: 7 × 2
## cityName mu_abs_acf
## <chr> <dbl>
## 1 Houston TX 0.0105
## 2 Las Vegas NV 0.0102
## 3 Los Angeles CA 0.0101
## 4 Miami FL 0.0100
## 5 Boston MA 0.00993
## 6 New York NY 0.00989
## 7 Chicago IL 0.00515
swrPACFAll_r21 %>%
filter(lag>0) %>%
ggplot(aes(x=lag, y=acf)) +
geom_line(aes(color=cityName, group=cityName), lwd=1) +
labs(x="Lag (zero excluded)", y="PACF", title="PACF of seasonally-adjusted shortwave radiation by city") +
scale_color_discrete(NULL) +
lims(y=c(-1, 1))
swrPACFAll_r21 %>%
filter(lag>0, lag<=30) %>%
ggplot(aes(x=lag, y=acf)) +
geom_line(aes(color=cityName, group=cityName), lwd=1) +
labs(x="Lag (zero excluded)", y="PACF", title="PACF of seasonally-adjusted shortwave radiation by city") +
scale_color_discrete(NULL) +
lims(y=c(-1, 1))
ACF is calculated and plotted for seasonally-adjusted temperature for each city:
tempACF_r21 <- lapply(dewCities,
FUN=function(x) {
tempDaily_vs_r21 %>%
filter(fct_city==x) %>%
arrange(date) %>%
select(delta_r21) %>%
acf(lag.max=2000, plot=FALSE)
}
)
names(tempACF_r21) <- dewCities
tempACFAll_r21 <- map_dfr(.x=dewCities,
.f=function(x) tibble::tibble(lag=as.vector(tempACF_r21[[x]][["lag"]]),
acf=as.vector(tempACF_r21[[x]][["acf"]])
),
.id="src") %>%
mutate(cityName=dewCities[as.integer(src)])
tempACFAll_r21
## # A tibble: 14,007 × 4
## src lag acf cityName
## <chr> <dbl> <dbl> <chr>
## 1 1 0 1 Boston MA
## 2 1 1 0.648 Boston MA
## 3 1 2 0.307 Boston MA
## 4 1 3 0.187 Boston MA
## 5 1 4 0.150 Boston MA
## 6 1 5 0.124 Boston MA
## 7 1 6 0.109 Boston MA
## 8 1 7 0.0873 Boston MA
## 9 1 8 0.0746 Boston MA
## 10 1 9 0.0637 Boston MA
## # ℹ 13,997 more rows
tempACFAll_r21 %>%
filter(lag > 0) %>%
group_by(cityName) %>%
summarize(mu_abs_acf=mean(abs(acf))) %>%
arrange(desc(mu_abs_acf))
## # A tibble: 7 × 2
## cityName mu_abs_acf
## <chr> <dbl>
## 1 Los Angeles CA 0.0282
## 2 Las Vegas NV 0.0232
## 3 Houston TX 0.0198
## 4 New York NY 0.0191
## 5 Boston MA 0.0174
## 6 Miami FL 0.0173
## 7 Chicago IL 0.0127
tempACFAll_r21 %>%
filter(lag>0) %>%
ggplot(aes(x=lag, y=acf)) +
geom_line(aes(color=cityName, group=cityName), lwd=1) +
labs(x="Lag (zero excluded)", y="ACF", title="ACF of seasonally-adjusted mean temperature by city") +
scale_color_discrete(NULL) +
lims(y=c(-1, 1))
tempACFAll_r21 %>%
filter(lag>0, lag<=30) %>%
ggplot(aes(x=lag, y=acf)) +
geom_line(aes(color=cityName, group=cityName), lwd=1) +
labs(x="Lag (zero excluded)", y="ACF", title="ACF of seasonally-adjusted mean temperature by city") +
scale_color_discrete(NULL) +
lims(y=c(-1, 1))
PACF is calculated and plotted for seasonally-adjusted temperature for each city:
tempPACF_r21 <- lapply(dewCities,
FUN=function(x) {
tempDaily_vs_r21 %>%
filter(fct_city==x) %>%
arrange(date) %>%
select(delta_r21) %>%
pacf(lag.max=2000, plot=FALSE)
}
)
names(tempPACF_r21) <- dewCities
tempPACFAll_r21 <- map_dfr(.x=dewCities,
.f=function(x) tibble::tibble(lag=as.vector(tempPACF_r21[[x]][["lag"]]),
acf=as.vector(tempPACF_r21[[x]][["acf"]])
),
.id="src") %>%
mutate(cityName=dewCities[as.integer(src)])
tempPACFAll_r21
## # A tibble: 14,000 × 4
## src lag acf cityName
## <chr> <dbl> <dbl> <chr>
## 1 1 1 0.648 Boston MA
## 2 1 2 -0.195 Boston MA
## 3 1 3 0.134 Boston MA
## 4 1 4 0.0104 Boston MA
## 5 1 5 0.0248 Boston MA
## 6 1 6 0.0303 Boston MA
## 7 1 7 -0.00303 Boston MA
## 8 1 8 0.0244 Boston MA
## 9 1 9 0.000595 Boston MA
## 10 1 10 0.0201 Boston MA
## # ℹ 13,990 more rows
tempPACFAll_r21 %>%
filter(lag > 0) %>%
group_by(cityName) %>%
summarize(mu_abs_acf=mean(abs(acf))) %>%
arrange(desc(mu_abs_acf))
## # A tibble: 7 × 2
## cityName mu_abs_acf
## <chr> <dbl>
## 1 Los Angeles CA 0.0108
## 2 New York NY 0.0107
## 3 Boston MA 0.0105
## 4 Las Vegas NV 0.0104
## 5 Houston TX 0.0104
## 6 Miami FL 0.0101
## 7 Chicago IL 0.00577
tempPACFAll_r21 %>%
filter(lag>0) %>%
ggplot(aes(x=lag, y=acf)) +
geom_line(aes(color=cityName, group=cityName), lwd=1) +
labs(x="Lag (zero excluded)", y="PACF", title="PACF of seasonally-adjusted mean temperature by city") +
scale_color_discrete(NULL) +
lims(y=c(-1, 1))
tempPACFAll_r21 %>%
filter(lag>0, lag<=30) %>%
ggplot(aes(x=lag, y=acf)) +
geom_line(aes(color=cityName, group=cityName), lwd=1) +
labs(x="Lag (zero excluded)", y="PACF", title="PACF of seasonally-adjusted mean temperature by city") +
scale_color_discrete(NULL) +
lims(y=c(-1, 1))
The ACF plot creation process is converted to functional form:
makeACFPlot <- function(df,
cities=dewCities,
keyVar="delta_r21",
fn=acf,
lagMax=2000,
lagCuts=list(c(1, +Inf)),
plotTitle="provided data",
printSummary=TRUE,
makePlots=TRUE,
returnData=TRUE
) {
# FUNCTIOn ARGUMENTS:
# df: tibble containing data
# cities: city names for including as colors in plot
# fn: the function to calculate (should be acf or pacf)
# lagMax: the maximum lage to use in fn
# lagCuts: list of tuples for plot x-axis min/max cut points list(c(x1min, x1max), c(x2min, x2max), ...)
# plotTitle: description for the plot data, will have "fn of " prepended
# printSummary: boolean, should a summary of mean absolute fn by city be printed?
# makePlots: boolean, should plots be created?
# returnData: bollean, should dfACF be returned?
vecACF <- lapply(cities,
FUN=function(x) {
df %>%
filter(fct_city==x) %>%
arrange(date) %>%
select(all_of(keyVar)) %>%
fn(lag.max=lagMax, plot=FALSE)
}
)
names(vecACF) <- cities
dfACF <- map_dfr(.x=cities,
.f=function(x) tibble::tibble(lag=as.vector(vecACF[[x]][["lag"]]),
acf=as.vector(vecACF[[x]][["acf"]])
),
.id="src"
) %>%
mutate(cityName=cities[as.integer(src)])
if(isTRUE(printSummary)) {
dfACF %>%
filter(lag > 0) %>%
group_by(cityName) %>%
summarize(mu_abs_acf=mean(abs(acf))) %>%
arrange(desc(mu_abs_acf)) %>%
print()
}
if(isTRUE(makePlots)) {
for(i in 1:length(lagCuts)) {
xmin <- lagCuts[[i]][1]
xmax <- lagCuts[[i]][2]
p1 <- dfACF %>%
filter(lag>=xmin, lag<=xmax) %>%
ggplot(aes(x=lag, y=acf)) +
geom_line(aes(color=cityName, group=cityName), lwd=1) +
labs(x="Lag",
y=toupper(deparse(substitute(fn))),
title=paste0(toupper(deparse(substitute(fn))), " of ", plotTitle)
) +
scale_color_discrete(NULL) +
lims(y=c(-1, 1))
print(p1)
}
}
if(isTRUE(returnData)) return(dfACF)
}
The function is tested:
# Base functionality
makeACFPlot(tempDaily_vs_r21, keyVar="delta_r21")
## # A tibble: 7 × 2
## cityName mu_abs_acf
## <chr> <dbl>
## 1 Los Angeles CA 0.0282
## 2 Las Vegas NV 0.0232
## 3 Houston TX 0.0198
## 4 New York NY 0.0191
## 5 Boston MA 0.0174
## 6 Miami FL 0.0173
## 7 Chicago IL 0.0127
## # A tibble: 14,007 × 4
## src lag acf cityName
## <chr> <dbl> <dbl> <chr>
## 1 1 0 1 Boston MA
## 2 1 1 0.648 Boston MA
## 3 1 2 0.307 Boston MA
## 4 1 3 0.187 Boston MA
## 5 1 4 0.150 Boston MA
## 6 1 5 0.124 Boston MA
## 7 1 6 0.109 Boston MA
## 8 1 7 0.0873 Boston MA
## 9 1 8 0.0746 Boston MA
## 10 1 9 0.0637 Boston MA
## # ℹ 13,997 more rows
# PACF functionality
makeACFPlot(tempDaily_vs_r21, keyVar="delta_r21", fn=pacf)
## # A tibble: 7 × 2
## cityName mu_abs_acf
## <chr> <dbl>
## 1 Los Angeles CA 0.0108
## 2 New York NY 0.0107
## 3 Boston MA 0.0105
## 4 Las Vegas NV 0.0104
## 5 Houston TX 0.0104
## 6 Miami FL 0.0101
## 7 Chicago IL 0.00577
## # A tibble: 14,000 × 4
## src lag acf cityName
## <chr> <dbl> <dbl> <chr>
## 1 1 1 0.648 Boston MA
## 2 1 2 -0.195 Boston MA
## 3 1 3 0.134 Boston MA
## 4 1 4 0.0104 Boston MA
## 5 1 5 0.0248 Boston MA
## 6 1 6 0.0303 Boston MA
## 7 1 7 -0.00303 Boston MA
## 8 1 8 0.0244 Boston MA
## 9 1 9 0.000595 Boston MA
## 10 1 10 0.0201 Boston MA
## # ℹ 13,990 more rows
# Exclude elements of printing/plotting
makeACFPlot(tempDaily_vs_r21, keyVar="delta_r21", printSummary=FALSE)
## # A tibble: 14,007 × 4
## src lag acf cityName
## <chr> <dbl> <dbl> <chr>
## 1 1 0 1 Boston MA
## 2 1 1 0.648 Boston MA
## 3 1 2 0.307 Boston MA
## 4 1 3 0.187 Boston MA
## 5 1 4 0.150 Boston MA
## 6 1 5 0.124 Boston MA
## 7 1 6 0.109 Boston MA
## 8 1 7 0.0873 Boston MA
## 9 1 8 0.0746 Boston MA
## 10 1 9 0.0637 Boston MA
## # ℹ 13,997 more rows
makeACFPlot(tempDaily_vs_r21, keyVar="delta_r21", returnData=FALSE)
## # A tibble: 7 × 2
## cityName mu_abs_acf
## <chr> <dbl>
## 1 Los Angeles CA 0.0282
## 2 Las Vegas NV 0.0232
## 3 Houston TX 0.0198
## 4 New York NY 0.0191
## 5 Boston MA 0.0174
## 6 Miami FL 0.0173
## 7 Chicago IL 0.0127
makeACFPlot(tempDaily_vs_r21, keyVar="delta_r21", makePlots=FALSE)
## # A tibble: 7 × 2
## cityName mu_abs_acf
## <chr> <dbl>
## 1 Los Angeles CA 0.0282
## 2 Las Vegas NV 0.0232
## 3 Houston TX 0.0198
## 4 New York NY 0.0191
## 5 Boston MA 0.0174
## 6 Miami FL 0.0173
## 7 Chicago IL 0.0127
## # A tibble: 14,007 × 4
## src lag acf cityName
## <chr> <dbl> <dbl> <chr>
## 1 1 0 1 Boston MA
## 2 1 1 0.648 Boston MA
## 3 1 2 0.307 Boston MA
## 4 1 3 0.187 Boston MA
## 5 1 4 0.150 Boston MA
## 6 1 5 0.124 Boston MA
## 7 1 6 0.109 Boston MA
## 8 1 7 0.0873 Boston MA
## 9 1 8 0.0746 Boston MA
## 10 1 9 0.0637 Boston MA
## # ℹ 13,997 more rows
# Specialized cut points
makeACFPlot(tempDaily_vs_r21, keyVar="delta_r21", lagCuts=list(c(0, +Inf), c(1, 10)))
## # A tibble: 7 × 2
## cityName mu_abs_acf
## <chr> <dbl>
## 1 Los Angeles CA 0.0282
## 2 Las Vegas NV 0.0232
## 3 Houston TX 0.0198
## 4 New York NY 0.0191
## 5 Boston MA 0.0174
## 6 Miami FL 0.0173
## 7 Chicago IL 0.0127
## # A tibble: 14,007 × 4
## src lag acf cityName
## <chr> <dbl> <dbl> <chr>
## 1 1 0 1 Boston MA
## 2 1 1 0.648 Boston MA
## 3 1 2 0.307 Boston MA
## 4 1 3 0.187 Boston MA
## 5 1 4 0.150 Boston MA
## 6 1 5 0.124 Boston MA
## 7 1 6 0.109 Boston MA
## 8 1 7 0.0873 Boston MA
## 9 1 8 0.0746 Boston MA
## 10 1 9 0.0637 Boston MA
## # ℹ 13,997 more rows
# Customized plot titles
makeACFPlot(tempDaily_vs_r21,
keyVar="delta_r21",
lagCuts=list(c(0, +Inf), c(1, 10)),
plotTitle="seasonally-adjusted mean temperature by city"
)
## # A tibble: 7 × 2
## cityName mu_abs_acf
## <chr> <dbl>
## 1 Los Angeles CA 0.0282
## 2 Las Vegas NV 0.0232
## 3 Houston TX 0.0198
## 4 New York NY 0.0191
## 5 Boston MA 0.0174
## 6 Miami FL 0.0173
## 7 Chicago IL 0.0127
## # A tibble: 14,007 × 4
## src lag acf cityName
## <chr> <dbl> <dbl> <chr>
## 1 1 0 1 Boston MA
## 2 1 1 0.648 Boston MA
## 3 1 2 0.307 Boston MA
## 4 1 3 0.187 Boston MA
## 5 1 4 0.150 Boston MA
## 6 1 5 0.124 Boston MA
## 7 1 6 0.109 Boston MA
## 8 1 7 0.0873 Boston MA
## 9 1 8 0.0746 Boston MA
## 10 1 9 0.0637 Boston MA
## # ℹ 13,997 more rows
# Create tibble for multiple elements
map_dfr(.x=c("delta_r21", "temp_r21", "meanTemp"),
.f=function(x) makeACFPlot(tempDaily_vs_r21, keyVar=x, printSummary=FALSE, makePlots=FALSE),
.id="src"
) %>%
mutate(src=c("delta_r21", "temp_r21", "meanTemp")[as.integer(src)]) %>%
pivot_wider(id_cols=c("lag", "cityName"), names_from="src", values_from="acf")
## # A tibble: 14,007 × 5
## lag cityName delta_r21 temp_r21 meanTemp
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 0 Boston MA 1 1 1
## 2 1 Boston MA 0.648 1.00 0.940
## 3 2 Boston MA 0.307 0.999 0.882
## 4 3 Boston MA 0.187 0.998 0.861
## 5 4 Boston MA 0.150 0.996 0.854
## 6 5 Boston MA 0.124 0.994 0.848
## 7 6 Boston MA 0.109 0.992 0.844
## 8 7 Boston MA 0.0873 0.990 0.839
## 9 8 Boston MA 0.0746 0.987 0.834
## 10 9 Boston MA 0.0637 0.985 0.830
## # ℹ 13,997 more rows
Relative ACF by city for mean temperature are plotted:
# Create tibble for multiple elements
dfTempACF <- map_dfr(.x=c("delta_r21", "temp_r21", "meanTemp"),
.f=function(x) makeACFPlot(tempDaily_vs_r21, keyVar=x, printSummary=FALSE, makePlots=FALSE),
.id="src"
) %>%
mutate(src=c("delta_r21", "temp_r21", "meanTemp")[as.integer(src)]) %>%
pivot_wider(id_cols=c("lag", "cityName"), names_from="src", values_from="acf")
dfTempACF
## # A tibble: 14,007 × 5
## lag cityName delta_r21 temp_r21 meanTemp
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 0 Boston MA 1 1 1
## 2 1 Boston MA 0.648 1.00 0.940
## 3 2 Boston MA 0.307 0.999 0.882
## 4 3 Boston MA 0.187 0.998 0.861
## 5 4 Boston MA 0.150 0.996 0.854
## 6 5 Boston MA 0.124 0.994 0.848
## 7 6 Boston MA 0.109 0.992 0.844
## 8 7 Boston MA 0.0873 0.990 0.839
## 9 8 Boston MA 0.0746 0.987 0.834
## 10 9 Boston MA 0.0637 0.985 0.830
## # ℹ 13,997 more rows
dfTempACF %>%
pivot_longer(cols=-c(lag, cityName)) %>%
ggplot(aes(x=lag, y=value)) +
geom_line(aes(color=name)) +
facet_wrap(~cityName) +
labs(x="Lag", y="ACF", title="ACF by mean temperature metric and city") +
scale_color_discrete("Metric") +
lims(y=c(-1, 1))
Relative PACF by city for mean temperature are plotted:
# Create tibble for multiple elements
dfTempPACF <- map_dfr(.x=c("delta_r21", "temp_r21", "meanTemp"),
.f=function(x) makeACFPlot(tempDaily_vs_r21,
fn=pacf,
keyVar=x,
printSummary=FALSE,
makePlots=FALSE
),
.id="src"
) %>%
mutate(src=c("delta_r21", "temp_r21", "meanTemp")[as.integer(src)]) %>%
pivot_wider(id_cols=c("lag", "cityName"), names_from="src", values_from="acf")
dfTempPACF
## # A tibble: 14,000 × 5
## lag cityName delta_r21 temp_r21 meanTemp
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 Boston MA 0.648 1.00 0.940
## 2 2 Boston MA -0.195 -0.344 -0.0166
## 3 3 Boston MA 0.134 -0.234 0.289
## 4 4 Boston MA 0.0104 -0.185 0.127
## 5 5 Boston MA 0.0248 -0.156 0.122
## 6 6 Boston MA 0.0303 -0.123 0.111
## 7 7 Boston MA -0.00303 -0.112 0.0658
## 8 8 Boston MA 0.0244 -0.0970 0.0823
## 9 9 Boston MA 0.000595 -0.0844 0.0481
## 10 10 Boston MA 0.0201 -0.0890 0.0575
## # ℹ 13,990 more rows
dfTempPACF %>%
pivot_longer(cols=-c(lag, cityName)) %>%
ggplot(aes(x=lag, y=value)) +
geom_line(aes(color=name)) +
facet_wrap(~cityName) +
labs(x="Lag", y="PACF", title="PACF by mean temperature metric and city") +
scale_color_discrete("Metric") +
lims(y=c(-1, 1))
dfTempPACF %>%
pivot_longer(cols=-c(lag, cityName)) %>%
filter(lag <= 20) %>%
ggplot(aes(x=lag, y=value)) +
geom_line(aes(color=name)) +
facet_wrap(~cityName) +
labs(x="Lag", y="PACF", title="PACF by mean temperature metric and city") +
scale_color_discrete("Metric") +
lims(y=c(-1, 1))
Relative ACF by city for mean dewpoint are plotted:
# Create tibble for multiple elements
dfDewACF <- map_dfr(.x=c("delta_r21", "dewpoint_r21", "dewpoint_2m_mean"),
.f=function(x) makeACFPlot(dewPDaily_vs_r21, keyVar=x, printSummary=FALSE, makePlots=FALSE),
.id="src"
) %>%
mutate(src=c("delta_r21", "dewpoint_r21", "dewpoint_2m_mean")[as.integer(src)]) %>%
pivot_wider(id_cols=c("lag", "cityName"), names_from="src", values_from="acf")
dfDewACF
## # A tibble: 14,007 × 5
## lag cityName delta_r21 dewpoint_r21 dewpoint_2m_mean
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 0 Boston MA 1 1 1
## 2 1 Boston MA 0.556 1.00 0.882
## 3 2 Boston MA 0.163 0.999 0.778
## 4 3 Boston MA 0.0744 0.998 0.754
## 5 4 Boston MA 0.0636 0.996 0.750
## 6 5 Boston MA 0.0631 0.994 0.749
## 7 6 Boston MA 0.0717 0.992 0.750
## 8 7 Boston MA 0.0732 0.990 0.749
## 9 8 Boston MA 0.0519 0.987 0.741
## 10 9 Boston MA 0.0205 0.984 0.731
## # ℹ 13,997 more rows
dfDewACF %>%
pivot_longer(cols=-c(lag, cityName)) %>%
ggplot(aes(x=lag, y=value)) +
geom_line(aes(color=name)) +
facet_wrap(~cityName) +
labs(x="Lag", y="ACF", title="ACF by mean dewpoint metric and city") +
scale_color_discrete("Metric") +
lims(y=c(-1, 1))
Relative PACF by city for mean dewpoint are plotted:
# Create tibble for multiple elements
dfDewPACF <- map_dfr(.x=c("delta_r21", "dewpoint_r21", "dewpoint_2m_mean"),
.f=function(x) makeACFPlot(dewPDaily_vs_r21,
fn=pacf,
keyVar=x,
printSummary=FALSE,
makePlots=FALSE
),
.id="src"
) %>%
mutate(src=c("delta_r21", "dewpoint_r21", "dewpoint_2m_mean")[as.integer(src)]) %>%
pivot_wider(id_cols=c("lag", "cityName"), names_from="src", values_from="acf")
dfDewPACF
## # A tibble: 14,000 × 5
## lag cityName delta_r21 dewpoint_r21 dewpoint_2m_mean
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 Boston MA 0.556 1.00 0.882
## 2 2 Boston MA -0.211 -0.364 -0.00177
## 3 3 Boston MA 0.124 -0.220 0.306
## 4 4 Boston MA -0.0143 -0.178 0.127
## 5 5 Boston MA 0.0413 -0.162 0.158
## 6 6 Boston MA 0.0307 -0.124 0.126
## 7 7 Boston MA 0.0236 -0.114 0.103
## 8 8 Boston MA -0.00249 -0.0968 0.0677
## 9 9 Boston MA -0.0108 -0.0872 0.0515
## 10 10 Boston MA 0.0188 -0.0878 0.0707
## # ℹ 13,990 more rows
dfDewPACF %>%
pivot_longer(cols=-c(lag, cityName)) %>%
ggplot(aes(x=lag, y=value)) +
geom_line(aes(color=name)) +
facet_wrap(~cityName) +
labs(x="Lag", y="PACF", title="PACF by mean dewpoint metric and city") +
scale_color_discrete("Metric") +
lims(y=c(-1, 1))
dfDewPACF %>%
pivot_longer(cols=-c(lag, cityName)) %>%
filter(lag <= 20) %>%
ggplot(aes(x=lag, y=value)) +
geom_line(aes(color=name)) +
facet_wrap(~cityName) +
labs(x="Lag", y="PACF", title="PACF by mean dewpoint metric and city") +
scale_color_discrete("Metric") +
lims(y=c(-1, 1))
Relative ACF by city for maximum wind speed are plotted:
keepVars <- c("delta_r21", "wind_r21", "maxWind")
useDF <- windDaily_vs_r21
# Create tibble for multiple elements
dfWindACF <- map_dfr(.x=keepVars,
.f=function(x) makeACFPlot(useDF, keyVar=x, printSummary=FALSE, makePlots=FALSE),
.id="src"
) %>%
mutate(src=keepVars[as.integer(src)]) %>%
pivot_wider(id_cols=c("lag", "cityName"), names_from="src", values_from="acf")
dfWindACF
## # A tibble: 14,007 × 5
## lag cityName delta_r21 wind_r21 maxWind
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 0 Boston MA 1 1 1
## 2 1 Boston MA 0.310 0.998 0.377
## 3 2 Boston MA -0.00134 0.995 0.0951
## 4 3 Boston MA -0.0171 0.992 0.0804
## 5 4 Boston MA -0.0105 0.988 0.0855
## 6 5 Boston MA 0.00550 0.984 0.100
## 7 6 Boston MA 0.00163 0.981 0.0973
## 8 7 Boston MA -0.0000671 0.977 0.0968
## 9 8 Boston MA 0.00612 0.972 0.103
## 10 9 Boston MA 0.0000161 0.967 0.0971
## # ℹ 13,997 more rows
dfWindACF %>%
pivot_longer(cols=-c(lag, cityName)) %>%
ggplot(aes(x=lag, y=value)) +
geom_line(aes(color=name)) +
facet_wrap(~cityName) +
labs(x="Lag", y="ACF", title="ACF by mean windspeed metric and city") +
scale_color_discrete("Metric") +
lims(y=c(-1, 1))
Relative PACF by city for mean maximum wind speed are plotted:
keepVars <- c("delta_r21", "wind_r21", "maxWind")
useDF <- windDaily_vs_r21
# Create tibble for multiple elements
dfWindPACF <- map_dfr(.x=keepVars,
.f=function(x) makeACFPlot(useDF,
fn=pacf,
keyVar=x,
printSummary=FALSE,
makePlots=FALSE
),
.id="src"
) %>%
mutate(src=keepVars[as.integer(src)]) %>%
pivot_wider(id_cols=c("lag", "cityName"), names_from="src", values_from="acf")
dfWindACF
## # A tibble: 14,007 × 5
## lag cityName delta_r21 wind_r21 maxWind
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 0 Boston MA 1 1 1
## 2 1 Boston MA 0.310 0.998 0.377
## 3 2 Boston MA -0.00134 0.995 0.0951
## 4 3 Boston MA -0.0171 0.992 0.0804
## 5 4 Boston MA -0.0105 0.988 0.0855
## 6 5 Boston MA 0.00550 0.984 0.100
## 7 6 Boston MA 0.00163 0.981 0.0973
## 8 7 Boston MA -0.0000671 0.977 0.0968
## 9 8 Boston MA 0.00612 0.972 0.103
## 10 9 Boston MA 0.0000161 0.967 0.0971
## # ℹ 13,997 more rows
dfWindPACF %>%
pivot_longer(cols=-c(lag, cityName)) %>%
ggplot(aes(x=lag, y=value)) +
geom_line(aes(color=name)) +
facet_wrap(~cityName) +
labs(x="Lag", y="PACF", title="PACF by maximum wind speed metric and city") +
scale_color_discrete("Metric") +
lims(y=c(-1, 1))
dfWindPACF %>%
pivot_longer(cols=-c(lag, cityName)) %>%
filter(lag <= 20) %>%
ggplot(aes(x=lag, y=value)) +
geom_line(aes(color=name)) +
facet_wrap(~cityName) +
labs(x="Lag", y="PACF", title="PACF by maximum wind speed metric and city") +
scale_color_discrete("Metric") +
lims(y=c(-1, 1))
Rolling 21-day mean maximum wind speed is further explored:
windDaily_vs_r21 %>%
filter(year(date)>2010) %>%
ggplot(aes(x=date, y=wind_r21)) +
geom_line() +
geom_smooth(aes(y=maxWind), color="lightblue", method="loess", span=21/365) +
facet_wrap(~fct_city) +
labs(x=NULL,
y="Rolling 21-day maximum wind speed",
caption="Black: Mean based on city and day-of-year\nBlue: Smooth on raw data"
)
## `geom_smooth()` using formula = 'y ~ x'
dfWindPACF %>%
filter(lag>=100) %>%
mutate(abs_wind_r21=abs(wind_r21)) %>%
group_by(cityName) %>%
slice_max(abs_wind_r21, n=6) %>%
print(n=50)
## # A tibble: 42 × 6
## # Groups: cityName [7]
## lag cityName delta_r21 wind_r21 maxWind abs_wind_r21
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 366 Boston MA 0.0253 -0.402 0.0298 0.402
## 2 345 Boston MA 0.0432 0.271 0.0446 0.271
## 3 1097 Boston MA -0.0227 -0.165 -0.0184 0.165
## 4 1462 Boston MA -0.00132 -0.162 0.00106 0.162
## 5 367 Boston MA -0.0123 -0.150 -0.00806 0.150
## 6 732 Boston MA 0.0158 -0.117 0.0191 0.117
## 7 366 Chicago IL 0.00214 -0.413 0.0101 0.413
## 8 1462 Chicago IL -0.0139 -0.376 -0.0106 0.376
## 9 345 Chicago IL 0.00555 0.304 0.0150 0.304
## 10 1097 Chicago IL -0.00303 -0.204 0.00194 0.204
## 11 367 Chicago IL 0.000115 -0.182 0.00780 0.182
## 12 1441 Chicago IL -0.00604 0.174 -0.00356 0.174
## 13 366 Houston TX 0.00345 -0.369 0.00925 0.369
## 14 345 Houston TX 0.00796 0.327 0.0135 0.327
## 15 1097 Houston TX 0.0307 -0.156 0.0342 0.156
## 16 387 Houston TX 0.00977 -0.142 0.00469 0.142
## 17 344 Houston TX 0.00636 0.140 0.0118 0.140
## 18 106 Houston TX 0.00569 0.136 -0.00385 0.136
## 19 366 Las Vegas NV -0.00407 -0.338 -0.00196 0.338
## 20 345 Las Vegas NV -0.0137 0.178 -0.0124 0.178
## 21 1462 Las Vegas NV -0.0124 -0.153 -0.00699 0.153
## 22 732 Las Vegas NV -0.00980 -0.127 -0.00974 0.127
## 23 1097 Las Vegas NV -0.0113 -0.126 -0.00796 0.126
## 24 367 Las Vegas NV -0.00594 -0.123 -0.00416 0.123
## 25 366 Los Angeles CA -0.00364 -0.412 -0.00117 0.412
## 26 345 Los Angeles CA -0.000526 0.271 0.00165 0.271
## 27 1097 Los Angeles CA -0.00697 -0.160 -0.00449 0.160
## 28 387 Los Angeles CA -0.0177 -0.154 -0.0174 0.154
## 29 1462 Los Angeles CA -0.0178 -0.145 -0.0148 0.145
## 30 1463 Los Angeles CA 0.0138 0.136 0.0177 0.136
## 31 366 Miami FL -0.0122 -0.402 -0.00415 0.402
## 32 345 Miami FL -0.0000806 0.222 0.00726 0.222
## 33 1462 Miami FL -0.0185 -0.165 -0.0150 0.165
## 34 1097 Miami FL -0.00502 -0.153 -0.00124 0.153
## 35 344 Miami FL 0.000391 0.139 0.00744 0.139
## 36 1463 Miami FL -0.0120 0.115 -0.00805 0.115
## 37 366 New York NY 0.00434 -0.349 0.0175 0.349
## 38 345 New York NY 0.0299 0.318 0.0417 0.318
## 39 1097 New York NY 0.00396 -0.137 0.0104 0.137
## 40 1462 New York NY -0.00945 -0.137 -0.00441 0.137
## 41 732 New York NY -0.000116 -0.126 0.00612 0.126
## 42 346 New York NY -0.00847 0.124 0.00292 0.124
There is a consistently large PACF at lag 366. Some cities appear to have a disconnected increase in mean maximum wind speed
Mean maximum wind speed by year is further explored:
windDaily_vs_r21 %>%
filter(year(date)>2010) %>%
group_by(fct_city, year) %>%
summarize(muWind=mean(maxWind), mur21=mean(wind_r21), mudelta=mean(delta_r21), .groups="drop") %>%
ggplot(aes(x=year, y=muWind)) +
geom_line() +
geom_line(aes(y=mur21), lty=2) +
facet_wrap(~fct_city) +
labs(x=NULL,
y="Mean maximum wind speed",
caption="Solid: Mean based on city and year\nDashed: Mean based on city"
)
The impact is particularly pronounced in Chicago and Las Vegas, with some visual impact also in Houston
The impact for Las Vegas is further explored, including the disconnect in 2017:
windDaily_vs_r21 %>%
filter(year(date)>2010, fct_city=="Las Vegas NV") %>%
mutate(mon=factor(month.abb[month(date)], levels=month.abb)) %>%
group_by(fct_city, year, mon) %>%
summarize(muWind=mean(maxWind), mur21=mean(wind_r21), mudelta=mean(delta_r21), .groups="drop") %>%
group_by(mon) %>%
mutate(multe2016=mean(ifelse(year<=2016, muWind, NA), na.rm=TRUE)) %>%
ungroup() %>%
ggplot(aes(x=factor(mon), y=muWind)) +
geom_line(aes(group=1), lwd=1.5, color="lightblue") +
geom_line(aes(y=multe2016, group=1), lty=2) +
facet_wrap(~year) +
labs(title="Las Vegas NV",
x=NULL,
y="Mean maximum wind speed",
caption="Solid: Mean based on month and year\nDashed: Mean by month (2011-2016 avg)"
)
There is likely a significant change in measuring site, methodology, etc. - effective January 2017 - for mean maximum wind speed in Las Vegas
The impact for Chicago is further explored, including the disconnect in 2017:
windDaily_vs_r21 %>%
filter(year(date)>2010, fct_city=="Chicago IL") %>%
mutate(mon=factor(month.abb[month(date)], levels=month.abb)) %>%
group_by(fct_city, year, mon) %>%
summarize(muWind=mean(maxWind), mur21=mean(wind_r21), mudelta=mean(delta_r21), .groups="drop") %>%
group_by(mon) %>%
mutate(multe2016=mean(ifelse(year<=2016, muWind, NA), na.rm=TRUE)) %>%
ungroup() %>%
ggplot(aes(x=factor(mon), y=muWind)) +
geom_line(aes(group=1), lwd=1.5, color="lightblue") +
geom_line(aes(y=multe2016, group=1), lty=2) +
facet_wrap(~year) +
labs(title="Chicago IL",
x=NULL,
y="Mean maximum wind speed",
caption="Solid: Mean based on month and year\nDashed: Mean by month (2011-2016 avg)"
)
There is likely a significant change in measuring site, methodology, etc. - effective January or February 2017 - for mean maximum wind speed in Chicago
Differences by month between pre/post-2017 are explored:
windDaily_vs_r21 %>%
filter(year(date)>2010) %>%
mutate(mon=factor(month.abb[month(date)], levels=month.abb)) %>%
group_by(fct_city, year, mon) %>%
summarize(muWind=mean(maxWind), mur21=mean(wind_r21), mudelta=mean(delta_r21), .groups="drop") %>%
group_by(fct_city, mon) %>%
mutate(multe2016=mean(ifelse(year<=2016, muWind, NA), na.rm=TRUE),
mugt2016=mean(ifelse(year>2016, muWind, NA), na.rm=TRUE)
) %>%
ungroup() %>%
filter(year %in% c(2016)) %>%
select(fct_city, mon, multe2016, mugt2016) %>%
pivot_longer(cols=-c(fct_city, mon)) %>%
ggplot(aes(x=factor(mon),
y=value,
group=name,
color=c("mugt2016"="2017-after", "multe2016"="2016-before")[name])
) +
geom_line() +
facet_wrap(~fct_city) +
scale_color_discrete(NULL) +
labs(title="Mean maximum wind speed by city and month\n(2011-2016 vs. 2017-2024)",
x=NULL,
y="Mean maximum wind speed"
)
Standard deviation for pre-2017 is added:
windDaily_vs_r21 %>%
filter(year(date)>2010) %>%
mutate(mon=factor(month.abb[month(date)], levels=month.abb)) %>%
group_by(fct_city, year, mon) %>%
summarize(muWind=mean(maxWind), mur21=mean(wind_r21), mudelta=mean(delta_r21), .groups="drop") %>%
group_by(fct_city, mon) %>%
mutate(multe2016=mean(ifelse(year<=2016, muWind, NA), na.rm=TRUE),
sdlte2016=sd(ifelse(year<=2016, muWind, NA), na.rm=TRUE),
mugt2016=mean(ifelse(year>2016, muWind, NA), na.rm=TRUE)
) %>%
ungroup() %>%
filter(year %in% c(2016)) %>%
select(fct_city, mon, multe2016, mugt2016, sdlte2016) %>%
pivot_longer(cols=-c(fct_city, mon, sdlte2016)) %>%
ggplot(aes(x=factor(mon),
y=value,
group=name,
color=c("mugt2016"="2017-after", "multe2016"="2016-before")[name])
) +
geom_line() +
geom_ribbon(data=~filter(., name=="multe2016"),
aes(ymin=value-sdlte2016, ymax=value+sdlte2016),
alpha=0.25,
lty=0,
fill="red"
) +
facet_wrap(~fct_city) +
scale_color_discrete(NULL) +
labs(title="Mean maximum wind speed by city and month\n(2011-2016 vs. 2017-2024)",
x=NULL,
y="Mean maximum wind speed",
caption="Ribbon for 2016-before is mean +/- 1 sd"
)
The disconnect in Las Vegas is further explored:
windDaily_vs_r21 %>%
filter(year(date)>2010, fct_city=="Las Vegas NV") %>%
mutate(mon=factor(month.abb[month(date)], levels=month.abb)) %>%
group_by(fct_city, year, mon) %>%
summarize(muWind=mean(maxWind), mur21=mean(wind_r21), mudelta=mean(delta_r21), .groups="drop") %>%
group_by(mon) %>%
mutate(multe2016=mean(ifelse(year<=2016, muWind, NA), na.rm=TRUE),
lbl=ifelse(year<=2016, "2011-2016", "2017-20247")
) %>%
ungroup() %>%
ggplot(aes(x=factor(year), y=muWind)) +
geom_col(aes(fill=lbl), alpha=0.5) +
geom_line(aes(y=multe2016, group=1), lty=2) +
facet_wrap(~mon) +
labs(title="Las Vegas NV",
x=NULL,
y="Mean maximum wind speed",
caption="Dotted line is monthly mean pre-2017"
) +
theme(axis.text.x=element_text(angle=90), legend.position="bottom") +
scale_fill_discrete(NULL)
The disconnect in Chicago is further explored:
windDaily_vs_r21 %>%
filter(year(date)>2010, fct_city=="Chicago IL") %>%
mutate(mon=factor(month.abb[month(date)], levels=month.abb)) %>%
group_by(fct_city, year, mon) %>%
summarize(muWind=mean(maxWind), mur21=mean(wind_r21), mudelta=mean(delta_r21), .groups="drop") %>%
group_by(mon) %>%
mutate(multe2016=mean(ifelse(year<=2016, muWind, NA), na.rm=TRUE),
lbl=ifelse(year<=2016, "2011-2016", "2017-20247")
) %>%
ungroup() %>%
ggplot(aes(x=factor(year), y=muWind)) +
geom_col(aes(fill=lbl), alpha=0.5) +
geom_line(aes(y=multe2016, group=1), lty=2) +
facet_wrap(~mon) +
labs(title="Chicago IL",
x=NULL,
y="Mean maximum wind speed",
caption="Dotted line is monthly mean pre-2017"
) +
theme(axis.text.x=element_text(angle=90), legend.position="bottom") +
scale_fill_discrete(NULL)
The pre/post-2017 jump in mean maximum wind speed is much sharper in Las Vegas than in Chcago
Mean temperature by year is explored:
tempDaily_vs_r21 %>%
filter(year(date)>2010, year(date)<2023) %>%
group_by(fct_city, year) %>%
summarize(muTemp=mean(meanTemp), mur21=mean(temp_r21), mudelta=mean(delta_r21), .groups="drop") %>%
ggplot(aes(x=year, y=muTemp)) +
geom_line() +
geom_line(aes(y=mur21), lty=2) +
facet_wrap(~fct_city) +
labs(x=NULL,
y="Mean temperature",
caption="Solid: Mean based on city and year\nDashed: Mean based on city"
)
Temperature by year appears much more stable, with no obvious disconnect in 2017
The process is converted to functional form:
tmpMetricByYear <- function(df, yearMin=-Inf, yearMax=+Inf, mainVar, dashVar, labTitle) {
p1 <- df %>%
filter(year(date)>=yearMin, year(date)<=yearMax) %>%
group_by(fct_city, year) %>%
summarize(mv=mean(get(mainVar)), dv=mean(get(dashVar)), .groups="drop") %>%
ggplot(aes(x=year, y=mv)) +
geom_line() +
geom_line(aes(y=dv), lty=2) +
facet_wrap(~fct_city) +
labs(x=NULL,
y=labTitle,
caption="Solid: Mean based on city and year\nDashed: Mean based on city"
)
p1
}